Gini index vs Entropy
Gini index and entropy is the criterion for calculating information gain. Decision tree algorithms use information gain to split a node.
Both gini and entropy are measures of impurity of a node. A node having multiple classes is impure whereas a node having only one class is pure. Entropy in statistics is analogous to entropy in thermodynamics where it signifies disorder. If there are multiple classes in a node, there is disorder in that node.

Information gain is the entropy of parent node minus sum of weighted entropies of child nodes.
Weight of a child node is number of samples in the node/total samples of all child nodes. Similarly information gain is calculated with gini score.
# Let's create functions to calculate gini and entropy scores
# Imports
from math import log
# calcpercent calculates the number of samples and percentages of each class
def calcpercent(node):
nodesum = sum(node.values())
percents = {c:v/nodesum for c,v in node.items()}
return nodesum, percents
# giniscore calculates the score for a node using above formula
def giniscore(node):
nodesum, percents = calcpercent(node)
score = round(1 - sum([i**2 for i in percents.values()]), 3)
print('Gini Score for node {} : {}'.format(node, score))
return score
# entropy score calculates the score for a node using above formula
def entropyscore(node):
nodesum, percents = calcpercent(node)
score = round(sum([-i*log(i,2) for i in percents.values()]), 3)
print('Entropy Score for node {} : {}'.format(node, score))
return score
# infogain calculates the information gain given parent node, child nodes and criterion
def infogain(parent, children, criterion):
score = {'gini': giniscore, 'entropy': entropyscore}
metric = score[criterion]
parentscore = metric(parent)
parentsum = sum(parent.values())
weighted_child_score = sum([metric(i)*sum(i.values())/parentsum for i in children])
gain = round((parentscore - weighted_child_score),2)
print('Information gain: {}'.format(gain))
return gain
# Parent node
parent_node = {'Red': 3, 'Blue':4, 'Green':5 }
# Let's say after the split nodes are
node1 = {'Red':3, 'Blue':4}
node2 = {'Green':5}
gini_gain = infogain(parent_node, [node1, node2], 'gini')
Gini Score for node {'Red': 3, 'Green': 5, 'Blue': 4} : 0.653 Gini Score for node {'Red': 3, 'Blue': 4} : 0.49 Gini Score for node {'Green': 5} : 0.0 Information gain: 0.37
entropy_gain = infogain(parent_node, [node1, node2], 'entropy')
Entropy Score for node {'Red': 3, 'Green': 5, 'Blue': 4} : 1.555 Entropy Score for node {'Red': 3, 'Blue': 4} : 0.985 Entropy Score for node {'Green': 5} : 0.0 Information gain: 0.98
# Performance wise there is not much difference between entropy and gini scores.
# Imports
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
# Load Dataset
# Dataset can be found at: https://www.kaggle.com/uciml/sms-spam-collection-dataset
df = pd.read_csv('spam.csv', encoding = 'latin-1' )
# Keep only necessary columns
df = df[['v2', 'v1']]
# Rename columns
df.columns = ['SMS', 'Type']
df.head()
# Let's view top 5 rows of the loaded dataset
df.head()
SMS | Type | |
---|---|---|
0 | Go until jurong point, crazy.. Available only … | ham |
1 | Ok lar… Joking wif u oni… | ham |
2 | Free entry in 2 a wkly comp to win FA Cup fina… | spam |
3 | U dun say so early hor… U c already then say… | ham |
4 | Nah I don’t think he goes to usf, he lives aro… | ham |
# Let's process the text data
# Instantiate count vectorizer
countvec = CountVectorizer(ngram_range=(1,4), stop_words='english', strip_accents='unicode', max_features=1000)
cdf = countvec.fit_transform(df.SMS)
# Instantiate algos
dt_gini = DecisionTreeClassifier(criterion='gini')
dt_entropy = DecisionTreeClassifier(criterion='entropy')
# ests = {'Logistic Regression':lr,'Decision tree': dt,'Random forest': rf, 'Naive Bayes': mnb}
ests = {'Decision tree with gini index': dt_gini, 'Decision tree with entropy': dt_unbal}
for est in ests:
print("{} score: {}%".format(est, round(cross_val_score(ests[est],X=cdf.toarray(), y=df.Type.values, cv=5).mean()*100, 3)))
print("\n")
Decision tree with gini index score: 96.572% Decision tree with entropy score: 96.464%
As we can see, there is not much performance difference when using gini index compared to entropy as splitting criterion. Therefore any one of gini or entropy can be used as splitting criterion.
Find More Such tutorials
That's how we learned about Gini index vs Entropy
That’s all for this mini tutorial. To sum it up, we learned about Gini index vs Entropy.
Hope it was easy, cool and simple to follow. Now it’s on you.
Related Resources:
- Build Decision Tree classification model in Python Build Decision Tree classifier Build Decision tree model. It is a machine learning algorithm which creates a tree on the...
- Reset Index in Pandas Dataframe | Pandas tutorial Reset Index Reset index in pandas using “reset_index” method of pandas dataframe. When we perform slicing or filtering operations on...
- Pandas Series Index | Pandas tutorial Create Pandas Series with Custom index Create Pandas Series with custom index using “Series” method of Pandas library and index...
- Pandas groupby tutorial | Understand Group by Pandas Groupby Group by is an important technique in Data Analysis and Pandas groupby method helps us achieve it. In...
- Visualize Decision tree | Machine Learning Visualize Decision tree Decision tree algorithms create splits on the basis of feature values and propagate the tree. Let’s Visualize...
- The Ultimate Guide to Classes in Python The Ultimate Guide to Classes in Python Python is a multi-paradigm programming language. It means Python can use different approaches...
- Python Lists | The No 1 Ultimate Guide Python Lists | The No 1 Ultimate Guide In this tutorial we will be taking a look at Python Lists,...
- Pandas series from Dictionary | Pandas Tutorial Pandas Series from Dictionary Create pandas series from dictionary using “Series” method of Pandas library. In the below example we...
- One Hot Encoding | What is one hot encoding? One Hot Encoding | Dummies One Hot encoding means splitting categorical variable into multiple binary variables. “One hot” means at...
- Not Operation in Pandas Conditions | Pandas tutorial Not Operation in Pandas Conditions Apply not operation in pandas conditions using (~ | tilde) operator. In this Pandas tutorial...
This information is worth everyone’s attention. When can I find out more?
Hi Audrea,
We are glad you found this information useful.
You can find more such articles by using the search option. Just type in the keywords and you’ll get best data science and python tutorials in the results.
Best Regards,
thatascience team
Intriguing post. I’ve been pondering about this issue, so
a debt of gratitude is in order for posting. Entirely cool post.
It ‘s extremely exceptionally decent and Useful post. Thanks!
Best regards,
Balle Valenzuela
Hi Balle,
It’s gives immense pleasure to see your beautiful comment.
More such tutorials being added every week.
Best Regards,
thatascience team
Very good job writing them in this blog article
Best regards,
Lunding Griffin
Very good info thanks so much!
Thank you the information.
Extremely helpful content thank you.
Exceptional read, Positive site, where diyou u come up with all the
information on this posting? I’ve read a few of
the posts on your website now, and I really like your style.
Best regards,
Harrell Cannon
Hello Harrell,
Thanks a lot for your beautiful words. Makes the whole team happy. A lot of research goes into making articles. Books, videos, tutorials etc are used to research the theory.
We’ll be coming up with new informative articles. Feel free to write to us if you want article on a topic.
Thanks and Regards,
team thatascience