Bag of words (bow) model is a way to preprocess text data for building machine learning models.

Natural language processing (NLP) uses bow technique to convert text documents to a machine understandable form. Each sentence is a document and words in the sentence are tokens. Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).

Code to create Bag of Words from sentences

# Imports 
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer

# Let's create sample data
data = ['We are good',
        'We are becoming better',
        'We will be great']

# Instantiate count vectorizer
countvec = CountVectorizer()
cdf = countvec.fit_transform(data)
bow = pd.DataFrame(cdf.toarray(), columns = countvec.get_feature_names())
bow
 arebebecomingbettergoodgreatwewill
010001010
110110010
201000111

That's how we Build Bag of words model

That’s all for this mini tutorial. To sum it up, we learned how to Build Bag of words model.

Hope it was easy, cool and simple to follow. Now it’s on you.

Leave a Reply

Your email address will not be published. Required fields are marked *