Stop words are words like a, an, the, is, has, of, are etc. Most of the times they add noise to the features.
Therefore removing stop words helps build cleaner dataset with better features for machine learning model. For text based problems, bag of words approach is a common technique. Let’s create a bag of words with no stop words.
By instantiating count vectorizer with stop_words parameter, we are telling count vectorizer to remove stop words.
# Imports
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
# Let's create sample data
data = ['We are good',
'We are becoming better',
'We will be great',
'This project is of great importance to us']
# Instantiate count vectorizer
countvec = CountVectorizer(stop_words='english')
cdf = countvec.fit_transform(data)
bow = pd.DataFrame(cdf.toarray(), columns = countvec.get_feature_names())
bow
better | good | great | importance | project | |
---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 1 | 1 | 1 |
Now this cleaned dataset is ready for machine learning algorithms. Learn to build a complete spam classifier from start to end.
That's how we learned about stop words removal
That’s all for this mini tutorial. To sum it up, we learned about stop words removal.
Hope it was easy, cool and simple to follow. Now it’s on you.
Related Resources:
- Bag of words model | NLP | scikit learn tokenizer Bag of words model Bag of words (bow) model is a way to preprocess text data for building machine learning...
- TF IDF score | Build Document Term Matrix dtm | NLP TF IDF scores TF IDF (term frequency-inverse document frequency) is a way to find important features and preprocess text data...
- PCA Principal Component Analysis PCA Principal Component Analysis PCA is a dimensionality reduction technique. PCA aims to find the direction of maximum spread(principal components)....
- Spam Classifier | Text Classification ML model Spam Classifier using Naive Bayes Spam classifier machine learning model is need of the hour as everyday we get thousands...
- Save Machine Learning model to a file | Pickle Save model to file Save machine learning model so that it can be used again and again without having to...
- Label Encoding | Encode categorical features Label Encoding | Encode Categorical features Label Encoding means converting categorical features into numerical values. Features which define a category...
- Iris Dataset – A Detailed Tutorial Iris Dataset Iris Dataset is a part of sklearn library. Sklearn comes loaded with datasets to practice machine learning techniques...
- Precision and Recall to evaluate classifier Precision and Recall Precision and Recall are metrics to evaluate a machine learning classifier. Accuracy can be misleading e.g. Let’s...
- Cross Validation | How good is your ML model? Cross Validation Cross Validation is a technique to estimate model performance. In N fold cross validation, data is divided into...
- Spot Checking different ML algorithms Spot Checking ML algorithms Spot Checking means trying different algorithms. There is no ‘one algorithm fits all’ in machine learning....