Bag of words (bow) model is a way to preprocess text data for building machine learning models.
Natural language processing (NLP) uses bow technique to convert text documents to a machine understandable form. Each sentence is a document and words in the sentence are tokens. Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).
Code to create Bag of Words from sentences
# Imports
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
# Let's create sample data
data = ['We are good',
'We are becoming better',
'We will be great']
# Instantiate count vectorizer
countvec = CountVectorizer()
cdf = countvec.fit_transform(data)
bow = pd.DataFrame(cdf.toarray(), columns = countvec.get_feature_names())
bow
are | be | becoming | better | good | great | we | will | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
That's how we Build Bag of words model
That’s all for this mini tutorial. To sum it up, we learned how to Build Bag of words model.
Hope it was easy, cool and simple to follow. Now it’s on you.
Related Resources:
- Stop words removal | NLP | Bag of words Stop words removal Stop words are words like a, an, the, is, has, of, are etc. Most of the times...
- TF IDF score | Build Document Term Matrix dtm | NLP TF IDF scores TF IDF (term frequency-inverse document frequency) is a way to find important features and preprocess text data...
- Boston Dataset | Scikit learn datasets Boston Dataset Boston Dataset is a part of sklearn library. Sklearn comes loaded with datasets to practice machine learning techniques...
- Pipeline in scikit learn | Machine Learning Tutorial Pipeline in scikit learn Pipeline in scikit learn simplifies whole machine learning model building and testing flow. Machine learning model...
- Spam Classifier | Text Classification ML model Spam Classifier using Naive Bayes Spam classifier machine learning model is need of the hour as everyday we get thousands...
- Save Machine Learning model to a file | Pickle Save model to file Save machine learning model so that it can be used again and again without having to...
- Cross Validation | How good is your ML model? Cross Validation Cross Validation is a technique to estimate model performance. In N fold cross validation, data is divided into...
- Digits Dataset | Scikit learn datasets Digits Dataset Digits Dataset is a part of sklearn library. Sklearn comes loaded with datasets to practice machine learning techniques...
- Build Decision Tree classification model in Python Build Decision Tree classifier Build Decision tree model. It is a machine learning algorithm which creates a tree on the...
- Building Adaboost classifier model in Python Building Adaboost classifier model Adaboost is a boosting algorithm which combines weak learners into a strong classifier. Let’s learn building...