Gini index and entropy is the criterion for calculating information gain. Decision tree algorithms use information gain to split a node.

Both gini and entropy are measures of impurity of a node. A node having multiple classes is impure whereas a node having only one class is pure.  Entropy in statistics is analogous to entropy in thermodynamics where it signifies disorder. If there are multiple classes in a node, there is disorder in that node. 
Gini index vs entropy
 
 

Information gain is the entropy of parent node minus sum of weighted entropies of child nodes. 
 Weight of a child node is number of samples in the node/total samples of all child nodes. Similarly information gain is calculated with gini score. 

# Let's create functions to calculate gini and entropy scores

# Imports
from math import log

# calcpercent calculates the number of samples and percentages of each class
def calcpercent(node):
    nodesum = sum(node.values())
    percents = {c:v/nodesum for c,v in node.items()}
    return nodesum, percents

# giniscore calculates the score for a node using above formula
def giniscore(node):
    nodesum, percents = calcpercent(node)
    score = round(1 - sum([i**2 for i in percents.values()]), 3)
    print('Gini Score for node {} : {}'.format(node, score))
    return score
    
# entropy score calculates the score for a node using above formula
def entropyscore(node):
    nodesum, percents = calcpercent(node)
    score = round(sum([-i*log(i,2) for i in percents.values()]), 3)
    print('Entropy Score for node {} : {}'.format(node, score))
    return score

# infogain calculates the information gain given parent node, child nodes and criterion
def infogain(parent, children, criterion):
    score = {'gini': giniscore, 'entropy': entropyscore}
    metric = score[criterion]
    parentscore = metric(parent)
    parentsum = sum(parent.values())
    weighted_child_score = sum([metric(i)*sum(i.values())/parentsum  for i in children])
    gain = round((parentscore - weighted_child_score),2)
    print('Information gain: {}'.format(gain))
    return gain
# Parent node
parent_node = {'Red': 3, 'Blue':4, 'Green':5 }

# Let's say after the split nodes are 
node1 = {'Red':3, 'Blue':4}
node2 = {'Green':5}
gini_gain = infogain(parent_node, [node1, node2], 'gini')
Gini Score for node {'Red': 3, 'Green': 5, 'Blue': 4} : 0.653
Gini Score for node {'Red': 3, 'Blue': 4} : 0.49
Gini Score for node {'Green': 5} : 0.0
Information gain: 0.37
entropy_gain = infogain(parent_node, [node1, node2], 'entropy')
Entropy Score for node {'Red': 3, 'Green': 5, 'Blue': 4} : 1.555
Entropy Score for node {'Red': 3, 'Blue': 4} : 0.985
Entropy Score for node {'Green': 5} : 0.0
Information gain: 0.98
# Performance wise there is not much difference between entropy and gini scores.
# Imports 
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

# Load Dataset 
# Dataset can be found at: https://www.kaggle.com/uciml/sms-spam-collection-dataset

df = pd.read_csv('spam.csv', encoding = 'latin-1' )

# Keep only necessary columns
df = df[['v2', 'v1']]

# Rename columns
df.columns = ['SMS', 'Type']
df.head()

# Let's view top 5 rows of the loaded dataset
df.head()
SMSType
0Go until jurong point, crazy.. Available only …ham
1Ok lar… Joking wif u oni…ham
2Free entry in 2 a wkly comp to win FA Cup fina…spam
3U dun say so early hor… U c already then say…ham
4Nah I don’t think he goes to usf, he lives aro…ham
# Let's process the text data 
# Instantiate count vectorizer
countvec = CountVectorizer(ngram_range=(1,4), stop_words='english',  strip_accents='unicode', max_features=1000)
cdf = countvec.fit_transform(df.SMS)

# Instantiate algos
dt_gini = DecisionTreeClassifier(criterion='gini')
dt_entropy = DecisionTreeClassifier(criterion='entropy')

# ests = {'Logistic Regression':lr,'Decision tree': dt,'Random forest': rf, 'Naive Bayes': mnb}
ests = {'Decision tree with gini index': dt_gini, 'Decision tree with entropy': dt_unbal}

for est in ests:
    print("{} score: {}%".format(est, round(cross_val_score(ests[est],X=cdf.toarray(), y=df.Type.values, cv=5).mean()*100, 3)))
    print("\n")
Decision tree with gini index score: 96.572%


Decision tree with entropy score: 96.464%


As we can see, there is not much performance difference when using gini index compared to entropy as splitting criterion. Therefore any one of gini or entropy can be used as splitting criterion. 

Find More Such tutorials

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages

That's how we learned about Gini index vs Entropy

That’s all for this mini tutorial. To sum it up, we learned about Gini index vs Entropy.

Hope it was easy, cool and simple to follow. Now it’s on you.

10 Responses

    1. Hi Audrea,

      We are glad you found this information useful.
      You can find more such articles by using the search option. Just type in the keywords and you’ll get best data science and python tutorials in the results.

      Best Regards,
      thatascience team

  1. Intriguing post. I’ve been pondering about this issue, so
    a debt of gratitude is in order for posting. Entirely cool post.

    It ‘s extremely exceptionally decent and Useful post. Thanks!

    Best regards,
    Balle Valenzuela

    1. Hi Balle,

      It’s gives immense pleasure to see your beautiful comment.
      More such tutorials being added every week.

      Best Regards,
      thatascience team

  2. Exceptional read, Positive site, where diyou u come up with all the
    information on this posting? I’ve read a few of
    the posts on your website now, and I really like your style.

    Best regards,
    Harrell Cannon

    1. Hello Harrell,

      Thanks a lot for your beautiful words. Makes the whole team happy. A lot of research goes into making articles. Books, videos, tutorials etc are used to research the theory.
      We’ll be coming up with new informative articles. Feel free to write to us if you want article on a topic.

      Thanks and Regards,
      team thatascience

Leave a Reply

Your email address will not be published.