In this regression tutorial we will talk about all the in-depth regression concepts and Clearing all the common doubts that may arise to a common reader.

Author: Sahil (You can also author posts on thatascience. Contact us)

Last Modified: 31 Jan, 2022

regression tutorial in python

What is Regression

In this part of the Regression Tutorial, I will explain what Regression actually is.

  • Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable y,
  • Based on the value of one or multiple predictor variables x.
  • More specifically, Regression analysis helps us to understand how the value of the dependent variable is changing
  • Corresponding to an independent variable when other independent variables remain the same.
  • It predicts continuous/real values such as temperature, age, salary, price, etc.
  • In Regression, we plot a graph between the variables which best fits the given data points, using this plot,
  • The machine learning model can make predictions about the data.
  • For example, given the data about the dimensions, location and condition of a house

We can predict the approximate cost of that house using Regression.

Why do we use Regression? What is the need of Regression?

In this part of the Regression Tutorial, I will explain what Regression has.

  • Regression analysis is used when you want to predict a continuous dependent variable from a number of independent variables.
  • Regression works on the principles on machine learning meaning that it is possible for the machine to find particular patterns in the dataset,That might not interest human eyes.
  • So, to gain a much higher accuracy, lower processing times and efficiency we make use regression techniques
  • Regression also enables us to use multiple techniques depending on the distribution of the dataset.

How to use Regression

Now, In this part of the Regression Tutorial, I will explain how regression is used.

  • The first thing you have to do is split your data into two arrays, X and y.
  • Each element of X is a feature, and the corresponding element of y is the associated target.
  • Once you have that, you will want to use sklearn.linear_model.LinearRegression to do the regression.
  • As for every sklearn model, there are two steps. First you must fit your data.

 

  • Then, put the features of which you want to predict the target in another array, X_predict,

 

  • And predict the target value using the predict method.
  • To check if the model is working in the desired way,

 

You can compare the original target (y) values you used for training to the predicted values and check the RMSE and R2Score.

[How to use Regression Code]

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error , classification_report
from sklearn.model_selection import train_test_split

#Dataset that contains 3 features and 1 target column
df = pd.read_csv("advertising.csv")

X = df[["TV"]]
y = df[["Sales"]]

model = LinearRegression().fit(X,y)

#predicting the values for the entire dataset
y_pred = model.predict(X)

#Using RMSE method for checking the performance
RMSE = np.sqrt(mean_squared_error(y, y_pred))
 

Simple and Multivariate Regression

This part of the Regression Tutorial I will explain the major differences between simple and multivariate regression.

  • Simple linear regression has only one x and one y variable.
  • Multiple linear regression has one y and two or more x variables.
  • For instance, when we predict rent based on age of building alone that is simple linear regression.
  • When we predict rent based on dimensions and age of the building that is an example of multiple linear regression.
  • Multiple regression is based on the assumption that there is a linear relationship between both the dependent and independent variables.
  • It also assumes no major correlation between the independent variables.

[Simple and Multivariate Regression Code]

#For Simple Linear Regression 
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error , classification_report
from sklearn.model_selection import train_test_split

#Dataset that contains 3 features and 1 target column
df = pd.read_csv("advertising.csv")

X = df[["TV"]]
y = df[["Sales"]]

model = LinearRegression().fit(X,y)

#predicting the values for the entire dataset
y_pred = model.predict(X)

#Using RMSE method for checking the performance of the model
RMSE = np.sqrt(mean_squared_error(y, y_pred))

#For Multivariate Linear Regression we use

#Allocating 3 columns instead of only one
X = df.iloc[:,:3]
y = df.iloc[:,3]

multi_linear_model = LinearRegression().fit(X,y)

#predicting the values for the entire dataset
y_pred = multi_linear_model.predict(X)

#Using RMSE method for checking the performance of the model
RMSE = np.sqrt(mean_squared_error(y, y_pred))
 

Different Types of Regression

In this next part of the Regression Tutorial I will explain all the different kinds of Regression Methods.

Linear Regression:

 

  • Linear regression is perhaps one of the most well-known and well understood algorithms in statistics and machine learning.

The Linear Regression model is represented using the following expression:

y = c0 + c1*x

  • In higher dimensions when we have more than one input (x), the line is called a plane or a hyper-plane. The representation therefore is the form of the equation and the specific values used for the coefficients.
  • In this given equation c0 and c1 are the coefficients used.
  • Before attempting to fit a linear model to observed data,
  • A developer should first determine whether there is a linear relationship between the variables of interest.

Polynomial Regression:

 

  • In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x
  • And the dependent variable y is modelled as a nth degree polynomial in x.
  • We use Polynomial regression when there are more than one features for the target variables

The Polynomial Regression model is represented using the following expression:

y = c0 + c1*x+c2*x^2

  • In this given equation c0, c1 and c2 are the coefficients used.
  • It is also possible to increase the degree of the model to fit it better to the data.
  • But do remember not to increase the degree of the model too much as it may cause the model to be overfitted,
  • And it won’t be able to predict general data anymore.

Lasso Regression:

 

  • Lasso regression is a type of linear regression that uses shrinkage.
  • Shrinkage is where data values shrink towards a central point, like the mean. The lasso procedure encourages simple, sparse models.
  • Lasso Regression performs L1 regularization meaning, it adds the “absolute value of magnitude” of coefficient as penalty term to the loss function.
  • It is also needed to choose the correct value of alpha while calling the Lasso Regression.
  • Alpha decides the amount of shrinkage of the features.
  • If the value of alpha is too high it may even remove the relevant features from the dataset

Ridge Regression:

 

  • Ridge Regression works very similar to Lasso Regression.
  • The major difference between the two methods comes from the fact that Ridge Regression makes use of L2 Regularization meaning that,
  • It adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents L2 regularization element.
  • The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether.
  • So, this works well for feature selection in case we have a huge number of features. 

Elasticity in Regression:

  • The concept of elasticity is borrowed from engineering and physics where it is used to measure a material’s responsiveness to a force,
  • Typically, a physical force such as a stretching/pulling force.
  • The price elasticity of defections is a measure of the relationship between
  • Change in quantity of defections as a result of change in service price.
  • A small change in service price that results in a large change in quantity of defections is said to be elastic.
  • Elasticity can be defined as the multiplication of the total means of the feature and target values

 

  • Multiplied by the total change cause in the target value for a unit change in the feature value.
  • Elasticity in Regression can be represented using the following Expression:

Δy/Δx*mean(x)/mean(y)

Difference in Regression and Classification

  • Regression as stated earlier is the process of predicting a continuous value given the parameters/features provided by the user.
  • Regression is the process of finding a model that predicts a continuous value based on its input variables. In regression problems, the goal is to mathematically estimate a mapping function.
  • Classification is the process of finding a model that separates input data into multiple discrete classes or labels.
  • In other words, a classification problem determines whether an input value can be part of a pre-identified group.
  • For Classification:
  • Consider an example of a military training system that checks the recruit’s weight, height and muscle mass
  • And classifies them into either “accepted” or “denied” according to those values.
  • For Regression:
  • Consider an example where we accept the location, dimensions and condition of the house and

 

  • Using that we can predict the estimated cost of the house

Difference between Linear and Logistic Regression

Linear Regression:

  • Linear regression is used to predict the continuous dependent variable using a given set of independent variables.
  • In linear regression, we find the best fit line, by which we can easily predict the output.
  • Least square estimation method is used for estimation of accuracy.
  • In Linear regression, it is required that relationship between dependent variable and independent variable must be linear.
  • In linear regression, there may be collinearity between the independent variables.

Logistic Regression:

  • Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables.
  • In Logistic Regression, we find the S-curve by which we can classify the samples.
  • Maximum likelihood estimation method is used for estimation of accuracy.
  • In Logistic regression, it is not required to have the linear relationship between the dependent and independent variable.
  • In logistic regression, there should not be collinearity between the independent variable.

Conclusion of Regression Tutorial

  • And there you have it, a complete in-depth guide on the Regression methods.
  • Regression is also known as the backbone of machine learning since rather than just classifying the inputs into given categories,
  • Regression gives us a clear vision into the output of the model which helps us understand the dataset better.
  • And for that same reason regression is harder to use and master as it uses many more constraints and methods.
  • Now, that you have gone through this tutorial why not pick up a dataset with continuous values and try your skills on it too? (Iris dataset).
  • Thank you, for reading this Regression Tutorial.

Leave a Reply

Your email address will not be published. Required fields are marked *