ROC curve and Area under the Curve (AUC)

  • ROC – Receiver operating characteristic curve is a curve between true positive rate and false positive rate for various threshold values. 

  • ROC curve tells us how good/bad model performance. More the area under ROC curve better is the model. Depending on machine learning problem we might have a preference to minimize one of the two errors namely False Positives, False Negatives. ROC curve let’s us choose a threshold for minimizing these errors. But it does not improve the model, it’s just playing with the threshold. For model improvement we must try other techniques like spot checking, hyper parameter tuning, feature engineering etc. 

Confusion Matrix Machine learning
 

True Positives [TP] = 221

True Negatives [TN] = 1414

False Positives [FP] = 20

False Negatives [FN] = 17

Recall or True Positive rate [TPR] = TP/(TP + FN)

False Positive Rate = FP/(FP + TN)

# Imports 
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt  
import warnings
import matplotlib.cbook
warnings.filterwarnings("ignore",category=matplotlib.cbook.mplDeprecation)


# Load Dataset 
# Dataset can be found at: https://www.kaggle.com/uciml/sms-spam-collection-dataset

df = pd.read_csv('spam.csv', encoding = 'latin-1' )

# Keep only necessary columns
df = df[['v2', 'v1']]

# Rename columns
df.columns = ['SMS', 'Type']

# Let's process the text data 
# Instantiate count vectorizer 
countvec = CountVectorizer(ngram_range=(1,4), 
                           stop_words='english',  
                           strip_accents='unicode', 
                           max_features=1000)

X = df.SMS.values
y = df.Type.values

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, 
                                                    random_state = 0)

# Instantiate classifier
mnb = MultinomialNB()

# Create bag of words
X_train = countvec.fit_transform(X_train)
X_test = countvec.transform(X_test)

# Train the classifier/Fit the model
mnb.fit(X_train, y_train)

# Make predictions
y_pred = mnb.predict(X_test)
y_pred_prob = mnb.predict_proba(X_test)
spam_probs = y_pred_prob[:,1]

# Build confusion metrics
fpr, tpr, threshold = roc_curve(y_true=y_test, y_score=spam_probs, pos_label='spam')
# Plot
plt.plot(fpr,tpr, color='red')
plt.title('Receiver Operating Characteristic Curve', size=20)
plt.plot([0, 1], [0, 1], color='green', linestyle=':')
plt.xlabel('False Positive Rate', size=15)
plt.ylabel('True Positive Rate', size=15)
plt.show()
 
 

Dotted green line denotes performance of random guess model

More the area under the curve better the classifier

Clearly our classifier beats the random guess model

y_true = np.array(list(map(lambda x: 1 if x=='spam' else 0, y_test)))
# Let's see what's the area under the curve auc for our model
auc = roc_auc_score(y_true=y_true, y_score=spam_probs)
print('Area under curve is {}'.format(round(auc, 2)))
 
Area under curve is 0.99

There are other metrics for evaluation of model performance like precision, recall and accuracy. Check them out. 

That's how we Build ROC curve and Area under the Curve (AUC)

That’s all for this mini tutorial. To sum it up, we learned how to Build ROC curve and Area under the Curve (AUC).

Hope it was easy, cool and simple to follow. Now it’s on you.

It's Your Turn Now!!!

  • Feel free to ask any doubts or questions in the comments.
  • Moreover, if you have a cooler approach to do above operations, please do share the code in comments.
  • In addition to the above, if you need any help in your Python or Machine learning journey, comment box is all yours.
  • Further, you can also send us an email.
  • For more cool stuff, follow thatascience on social media Twitter, Facebook, Linkedin, Instagram.

Related Tutorials

Learn More from bite sized, simple and easy to follow tutorials

ML logo thatascience.com learn data science ML concepts
MACHINE LEARNING
Python logo thatascience.com learn Data Science with Python
PYTHON
Numpy for Machine Learning
NUMPY
Pandas for Data Science
PANDAS
logo of thatascience.com | Data Science Machine Learning Deep Learning mini tutorials
THAT-A-SCIENCE

Leave a Reply