# ROC curve and Area under the Curve (AUC)

• ## ROC – Receiver operating characteristic curve is a curve between true positive rate and false positive rate for various threshold values.

• ROC curve tells us how good/bad model performance. More the area under ROC curve better is the model. Depending on machine learning problem we might have a preference to minimize one of the two errors namely False Positives, False Negatives. ROC curve let’s us choose a threshold for minimizing these errors. But it does not improve the model, it’s just playing with the threshold. For model improvement we must try other techniques like spot checking, hyper parameter tuning, feature engineering etc. True Positives [TP] = 221

True Negatives [TN] = 1414

False Positives [FP] = 20

False Negatives [FN] = 17

Recall or True Positive rate [TPR] = TP/(TP + FN)

False Positive Rate = FP/(FP + TN)

# Imports
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import matplotlib.cbook
warnings.filterwarnings("ignore",category=matplotlib.cbook.mplDeprecation)

# Dataset can be found at: https://www.kaggle.com/uciml/sms-spam-collection-dataset

df = pd.read_csv('spam.csv', encoding = 'latin-1' )

# Keep only necessary columns
df = df[['v2', 'v1']]

# Rename columns
df.columns = ['SMS', 'Type']

# Let's process the text data
# Instantiate count vectorizer
countvec = CountVectorizer(ngram_range=(1,4),
stop_words='english',
strip_accents='unicode',
max_features=1000)

X = df.SMS.values
y = df.Type.values

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 0)

# Instantiate classifier
mnb = MultinomialNB()

# Create bag of words
X_train = countvec.fit_transform(X_train)
X_test = countvec.transform(X_test)

# Train the classifier/Fit the model
mnb.fit(X_train, y_train)

# Make predictions
y_pred = mnb.predict(X_test)
y_pred_prob = mnb.predict_proba(X_test)
spam_probs = y_pred_prob[:,1]

# Build confusion metrics
fpr, tpr, threshold = roc_curve(y_true=y_test, y_score=spam_probs, pos_label='spam')

# Plot
plt.plot(fpr,tpr, color='red')
plt.plot([0, 1], [0, 1], color='green', linestyle=':')
plt.xlabel('False Positive Rate', size=15)
plt.ylabel('True Positive Rate', size=15)
plt.show() Dotted green line denotes performance of random guess model

More the area under the curve better the classifier

Clearly our classifier beats the random guess model

y_true = np.array(list(map(lambda x: 1 if x=='spam' else 0, y_test)))

# Let's see what's the area under the curve auc for our model
auc = roc_auc_score(y_true=y_true, y_score=spam_probs)
print('Area under curve is {}'.format(round(auc, 2)))


Area under curve is 0.99


There are other metrics for evaluation of model performance like precision, recall and accuracy. Check them out.

## That's how we Build ROC curve and Area under the Curve (AUC)

That’s all for this mini tutorial. To sum it up, we learned how to Build ROC curve and Area under the Curve (AUC).

Hope it was easy, cool and simple to follow. Now it’s on you.