• TF IDF (term frequency-inverse document frequency) is a way to find important features and preprocess text data for building machine learning models.

Full form of TF is term frequency. It is the count of word “x” in a sentence.

Full form of IDF is inverse document frequency. Document frequency is the number of documents which contain the word “x”.  

Natural language processing (NLP) uses tf-idf technique to convert text documents to a machine understandable form. Each sentence is a document and words in the sentence are tokens. Tfidf vectorizer creates a matrix with documents and token scores therefore it is also known as document term matrix (dtm).

 

# Imports 
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer

# Let's create sample data
data = ['We are good',
        'We are becoming better',
        'We will be great']

# Instantiate count vectorizer
tfvec = TfidfVectorizer()
tdf = tfvec.fit_transform(data)
bow = pd.DataFrame(tdf.toarray(), columns = tfvec.get_feature_names())
bow
are be becoming better good great we will
0 0.547832 0.000000 0.000000 0.000000 0.720333 0.000000 0.425441 0.000000
1 0.444514 0.000000 0.584483 0.584483 0.000000 0.000000 0.345205 0.000000
2 0.000000 0.546454 0.000000 0.000000 0.000000 0.546454 0.322745 0.546454

That's how we learned about TF IDF scores

That’s all for this mini tutorial. To sum it up, we learned how to learn about TF IDF scores.

Hope it was easy, cool and simple to follow. Now it’s on you.