Stop words are words like a, an, the, is, has, of, are etc. Most of the times they add noise to the features.
Therefore removing stop words helps build cleaner dataset with better features for machine learning model. For text based problems, bag of words approach is a common technique. Let’s create a bag of words with no stop words.
By instantiating count vectorizer with stop_words parameter, we are telling count vectorizer to remove stop words.
# Imports import numpy as np import pandas as pd import os from sklearn.feature_extraction.text import CountVectorizer # Let's create sample data data = ['We are good', 'We are becoming better', 'We will be great', 'This project is of great importance to us'] # Instantiate count vectorizer countvec = CountVectorizer(stop_words='english') cdf = countvec.fit_transform(data) bow = pd.DataFrame(cdf.toarray(), columns = countvec.get_feature_names()) bow
Now this cleaned dataset is ready for machine learning algorithms. Learn to build a complete spam classifier from start to end.
That's how we learned about stop words removal
That’s all for this mini tutorial. To sum it up, we learned about stop words removal.
Hope it was easy, cool and simple to follow. Now it’s on you.