• Stop words are words like a, an, the, is, has, of, are etc. Most of the times they add noise to the features. 

  • Therefore removing stop words helps build cleaner dataset with better features for machine learning model. For text based problems, bag of words approach is a common technique. Let’s create a bag of words with no stop words. 

  • By instantiating count vectorizer with stop_words parameter, we are telling count vectorizer to remove stop words. 

# Imports 
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer

# Let's create sample data
data = ['We are good',
        'We are becoming better',
        'We will be great',
       'This project is of great importance to us']

# Instantiate count vectorizer
countvec = CountVectorizer(stop_words='english')
cdf = countvec.fit_transform(data)
bow = pd.DataFrame(cdf.toarray(), columns = countvec.get_feature_names())
bow
better good great importance project
0 0 1 0 0 0
1 1 0 0 0 0
2 0 0 1 0 0
3 0 0 1 1 1
 

Now this cleaned dataset is ready for machine learning algorithms. Learn to build a complete spam classifier from start to end.

That's how we learned about stop words removal

That’s all for this mini tutorial. To sum it up, we learned about stop words removal.

Hope it was easy, cool and simple to follow. Now it’s on you.