Mastering Linear Regression: Unraveling the Art of Data Prediction

Ashutosh Sahu
Analytics Vidhya
Published in
5 min readAug 5, 2023

--

Hello Everyone đź‘‹ when we begin the Model Building part of the data science lifecycle, the first model that everyone tries is Simple Linear Regression or Linear Regression. In this blog, we will learn about the geometric and mathematical understanding behind the algorithm, as well as the Python code implementation.

Introduction

Linear Regression is the most simplest and easy-to-understand regression model, with one independent variable and one output variable.
Assuming we have data from a college placement cell with two input columns, cgpa and CTC (in Lacs per annum), we have to create a model such that if we give the cgpa of a student, the model will predict the CTC.

Geometric Intuition

The first step will be to plot the scatterplot between CGPA and CTC, which reveals that the data is sort of linear in nature. The first thing that will arise in your mind is why the data is not completely linear and is sort of linear. Obviously, obtaining perfect linear data in real-world circumstances will be quite difficult. The scatterplot of CGPA versus CTC is given below:

Scatterplot between CGPA and CTC

The idea here is to find a line that best fits these points in such a way that it minimizes the distance between the points on the line and the actual data points. This line is known as the best-fit line or regression line, and it has the following equation:

Y = mX + b

where :

  • Y is the dependent variable( in our case it is CTC)
  • X is the independent variable(in our case it is CGPA)
  • m is the slope of the line which represents the rate of change of Y with respect to X.
  • b is the y-intercept, which is the value of Y when X is zero.

The objective of linear regression is to determine the m and b values that minimize the difference between the actual and predicted Y values. For example, in the figure below, if we compare lines l1 and l2, it is evident that line l1 is the better fit line when compared to l2.

Best Fit Line Comparison

We will understand the actual error terms and formulation of m and b in the next section.

Mathematical Formulation

There are two solutions to get the values of m and b for regression line:

  1. Closed Form Solution
  2. Non-Closed Form Solution

The scikit-learn library class LinearRegression used the Closed Form solution which actually means to get the values in terms of formula. This technique is also called OLS or Ordinary Least Square method.

The non-closed form solutions largely discuss approximation and calculus approaches for obtaining the values of m and b, which we will learn when we learn Gradient Descent, which is implemented in the SGDRegressor class in the scikit-learn library.

Again, why do we have two techniques to solve this equation? The answer is that in fewer dimensional space, the values of m and b are easily obtained using the Closed Form Solution. When we have a high dimensional space, we employ the non-closed form solution because it is difficult to make calculations using the closed form solution.

The mathematical Formulation setup is shown below:

Initial Setup

Let’s understand the complication here, we have plotted the CGPA vs CTC graph. Next, we draw the best fit line having minimum errors for all the points. The error can be depicted as the distance between the actual value/point(X) to the predicted value.

Let’s suppose the distance for the first, second and nth point is d1, d2,…,dn from the line. To reduce the influence of negative distance, we take the square of the distance so that the positive and negative distances do not cancel each other out, and lastly we get our final equation by putting the yi hat value, which is the value of point xi on the line. Since this is the error function, we need to minimize it and get values of m and b.

We already know that if we want to determine the minima of any function, we will differentiate the equation and equate it to zero to obtain the point where the error function or the slope of the function is zero.

Since our error function is dependent on both m and b we will do partial differentiation and that will give us the equation for m and b. The derivation is shown below:

Derivation of m and b

Bamm!! That was difficult, but we made it. Sklearn use the same equations to determine the values of m and b. To compare, we will first create the sklearn LinearRegression class, then write our own LinearRegression class based on the above equations, and then compare the performance to validate the same.

Code Implementation

As discussed above, we will use the same student placement data for our implementation. The following code demonstrates the approach using the scikit-learn LinearRegression class:

Sklearn Simple Linear Regression

Now, we will write our own Linear Regression class using the equation we derived in the previous section:

As we can see we are getting the same value for m and b. All the code and dataset used is available here.

Congratulations !!! on understanding Simple Linear Regression, the geometric and mathematical principles behind it, and the code implementation using Sklearn and your own custom class.

Thank you for taking the time to read this post. If you liked this read, hit the đź‘Ź button and share it with others. You can also check other interesting articles under my Medium profile. If you have any questions, please leave them in the comments section and I will do my best to answer them.

You can connect with me on LinkedIn, Facebook, and Instagram.

Until next time, Adios Amigo!!!!

--

--