# Gini index vs Entropy

Gini index and entropy is the criterion for calculating information gain. Decision tree algorithms use information gain to split a node.

Both gini and entropy are measures of impurity of a node. A node having multiple classes is impure whereas a node having only one class is pure.  Entropy in statistics is analogous to entropy in thermodynamics where it signifies disorder. If there are multiple classes in a node, there is disorder in that node.

Information gain is the entropy of parent node minus sum of weighted entropies of child nodes.
Weight of a child node is number of samples in the node/total samples of all child nodes. Similarly information gain is calculated with gini score.

```# Let's create functions to calculate gini and entropy scores

# Imports
from math import log

# calcpercent calculates the number of samples and percentages of each class
def calcpercent(node):
nodesum = sum(node.values())
percents = {c:v/nodesum for c,v in node.items()}
return nodesum, percents

# giniscore calculates the score for a node using above formula
def giniscore(node):
nodesum, percents = calcpercent(node)
score = round(1 - sum([i**2 for i in percents.values()]), 3)
print('Gini Score for node {} : {}'.format(node, score))
return score

# entropy score calculates the score for a node using above formula
def entropyscore(node):
nodesum, percents = calcpercent(node)
score = round(sum([-i*log(i,2) for i in percents.values()]), 3)
print('Entropy Score for node {} : {}'.format(node, score))
return score

# infogain calculates the information gain given parent node, child nodes and criterion
def infogain(parent, children, criterion):
score = {'gini': giniscore, 'entropy': entropyscore}
metric = score[criterion]
parentscore = metric(parent)
parentsum = sum(parent.values())
weighted_child_score = sum([metric(i)*sum(i.values())/parentsum  for i in children])
gain = round((parentscore - weighted_child_score),2)
print('Information gain: {}'.format(gain))
return gain
```
```# Parent node
parent_node = {'Red': 3, 'Blue':4, 'Green':5 }

# Let's say after the split nodes are
node1 = {'Red':3, 'Blue':4}
node2 = {'Green':5}
```
```gini_gain = infogain(parent_node, [node1, node2], 'gini')
```
```Gini Score for node {'Red': 3, 'Green': 5, 'Blue': 4} : 0.653
Gini Score for node {'Red': 3, 'Blue': 4} : 0.49
Gini Score for node {'Green': 5} : 0.0
Information gain: 0.37
```
```entropy_gain = infogain(parent_node, [node1, node2], 'entropy')
```
```Entropy Score for node {'Red': 3, 'Green': 5, 'Blue': 4} : 1.555
Entropy Score for node {'Red': 3, 'Blue': 4} : 0.985
Entropy Score for node {'Green': 5} : 0.0
Information gain: 0.98
```
```# Performance wise there is not much difference between entropy and gini scores.
```
```# Imports
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

# Dataset can be found at: https://www.kaggle.com/uciml/sms-spam-collection-dataset

df = pd.read_csv('spam.csv', encoding = 'latin-1' )

# Keep only necessary columns
df = df[['v2', 'v1']]

# Rename columns
df.columns = ['SMS', 'Type']

# Let's view top 5 rows of the loaded dataset
```
SMS Type
0 Go until jurong point, crazy.. Available only … ham
1 Ok lar… Joking wif u oni… ham
2 Free entry in 2 a wkly comp to win FA Cup fina… spam
3 U dun say so early hor… U c already then say… ham
4 Nah I don’t think he goes to usf, he lives aro… ham
```# Let's process the text data
# Instantiate count vectorizer
countvec = CountVectorizer(ngram_range=(1,4), stop_words='english',  strip_accents='unicode', max_features=1000)
cdf = countvec.fit_transform(df.SMS)

# Instantiate algos
dt_gini = DecisionTreeClassifier(criterion='gini')
dt_entropy = DecisionTreeClassifier(criterion='entropy')

# ests = {'Logistic Regression':lr,'Decision tree': dt,'Random forest': rf, 'Naive Bayes': mnb}
ests = {'Decision tree with gini index': dt_gini, 'Decision tree with entropy': dt_unbal}

for est in ests:
print("{} score: {}%".format(est, round(cross_val_score(ests[est],X=cdf.toarray(), y=df.Type.values, cv=5).mean()*100, 3)))
print("\n")
```
```Decision tree with gini index score: 96.572%

Decision tree with entropy score: 96.464%

```

As we can see, there is not much performance difference when using gini index compared to entropy as splitting criterion. Therefore any one of gini or entropy can be used as splitting criterion.

## Find More Such tutorials

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages

## That's how we learned about Gini index vs Entropy

That’s all for this mini tutorial. To sum it up, we learned about Gini index vs Entropy.

Hope it was easy, cool and simple to follow. Now it’s on you.

### 10 Responses

1. Audrea says:

This information is worth everyone’s attention. When can I find out more?

Hi Audrea,

We are glad you found this information useful.
You can find more such articles by using the search option. Just type in the keywords and you’ll get best data science and python tutorials in the results.

Best Regards,
thatascience team

2. Balle says:

a debt of gratitude is in order for posting. Entirely cool post.

It ‘s extremely exceptionally decent and Useful post. Thanks!

Best regards,
Balle Valenzuela

Hi Balle,

It’s gives immense pleasure to see your beautiful comment.
More such tutorials being added every week.

Best Regards,
thatascience team

3. 3M 3200 says:

Very good job writing them in this blog article
Best regards,
Lunding Griffin

4. Evangelina Goodrow says:

Very good info thanks so much!

5. Jason Roy says:

Thank you the information.

6. Mia Park says:

7. Harrell Cannon says:

Exceptional read, Positive site, where diyou u come up with all the
information on this posting? I’ve read a few of
the posts on your website now, and I really like your style.

Best regards,
Harrell Cannon