Spot Checking ML algorithms

  • Spot Checking means trying different algorithms.

There is no ‘one algorithm fits all’ in machine learning. An algorithm good for one problem might perform badly for another problem therefore it is necessary to check a few algorithms.

Let’s do it in a  simple way.

 

# Imports 
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

# Load Dataset 
# Dataset can be found at: https://www.kaggle.com/uciml/sms-spam-collection-dataset

df = pd.read_csv('spam.csv', encoding = 'latin-1' )

# Keep only necessary columns
df = df[['v2', 'v1']]

# Rename columns
df.columns = ['SMS', 'Type']
df.head()

# Let's view top 5 rows of the loaded dataset
df.head()
SMSType
0Go until jurong point, crazy.. Available only …ham
1Ok lar… Joking wif u oni…ham
2Free entry in 2 a wkly comp to win FA Cup fina…spam
3U dun say so early hor… U c already then say…ham
4Nah I don’t think he goes to usf, he lives aro…ham
# Let's see how many spams and hams are there
df.Type.value_counts()
ham     4825
spam     747
Name: Type, dtype: int64
# Let's process the text data 
# Instantiate count vectorizer
countvec = CountVectorizer(ngram_range=(1,4), stop_words='english',  strip_accents='unicode', max_features=1000)
# countvec = TfidfVectorizer(ngram_range=(1,2), stop_words='english',  strip_accents='unicode', max_features=100)
cdf = countvec.fit_transform(df.SMS)

# Instantiate algos
lr = LogisticRegression(penalty='l2')
dt = DecisionTreeClassifier(class_weight="balanced")
mnb = MultinomialNB()
rf = RandomForestClassifier(n_jobs=-1)

ests = {'Logistic Regression':lr,'Decision tree': dt,'Random forest': rf, 'Naive Bayes': mnb}

for est in ests:
    print("{} score: {}%".format(est, round(cross_val_score(ests[est],X=cdf.toarray(), y=df.Type.values, cv=5).mean()*100, 3)))
    print("\n")
Naive Bayes score: 98.187%


Decision tree score: 94.311%


Random forest score: 97.469%


Logistic Regression score: 97.864%


That's how we Build Spot Checking ML algorithms

That’s all for this mini tutorial. To sum it up, we learned how to Build Spot Checking ML algorithms.

Hope it was easy, cool and simple to follow. Now it’s on you.

Leave a Reply

Your email address will not be published.