Linear Regression in Python, from Scratch
By: Trevor Rowland (dBCooper2)
Creating Linear Regression Models from Scratch.
| | |
Links to Files
Scripts containing each of the functions written in this notebook can be found at the following links
OLS_Linear_Regression Class MLR_Class
Table of Contents
.. 1. The Theory
.... a. Introduction to Simple Linear Regression
.... b. The Mean Squared Error
.... c. The Partial Derivatives of the Error Function
.... d. The Gradient Descent
.. 2. Applying the Theory to Python
.... a. The Gradient Descent Function in Python
.... b. Performing Simple Linear Regression
.... c. Testing the Linear Regression Model
.... d. Plotting the Regression Line
3-4 Multiple Linear Regression
.. 3. The Theory
.... a. The Model
.... b. The Error Function
.... c. Computing the Error Function for the MLR Model
.... d. The Partial Derivative of the Error Function
.... e. Interpreting the Theory to Translate into an Algorithm
.. 4. Applying the Theory to Python
.... a. Function Definitions
.... b. Accessing the Data and Performing Multiple Linear Regression
...... i. The 3-Factor Model
...... ii. The 3-Factor Model in Python
...... iii. The 5-Factor Model
...... iv. The 5-Factor Model in Python
.... c. Visualizing the Regression Results
1-2. Simple Linear Regression
Regression Analysis is a tool used in statistics and finance to see how strongly related an dependent variable and one or more independent variables are.
The Simple Linear Regression model was made by following the Simple Linear Regression Tutorial by NeuralNine
1. The Theory
1.a. Introduction to Simple Linear Regression
The Simple Linear Regression model uses an Ordinary Least Squares(OLS) approach to regression. The OLS model plots a line on a scatter plot, measures how far away it is from each point, then iteratively adjusts the slope and y-intercept in the linear equation to provide the line of best fit for the data.
How does this happen?
The Regression Model plots a line through all of the points in our dataset.
When the line is plotted, the points on the line will be different from the points in the dataset. The difference between the actual point and the point estimated by the line () can be called an error.
The sum of those errors can be calculated to find the total error in the regression line.
Squaring those errors and dividing that sum of all squared errors by the number of y-values gives us a measure called the Mean Squared Error, or :
1.b. The Mean Squared Error
The Mean Squared Error describes what the average error is, and to make the best-fit regression line, that error must be minimized.
Because the data cannot be modified, to develop the best-fit regression line the slope () and the y-intercept() must be modified. This involves iterating over many different calculated values of and , so how will the program know how to adjust the values across iterations?
The program will adjust the values by calculating the gradient descent of the Error() with respect to and with respect to . This can be done using partial derivatives of the Error function to find the fastest way to increase the Error because derivatives measure a rate of change. Here are the calculations to find those gradient descent functions:
Which decomposes into:
And to calculate the formulas for the optimization of the regression line that will be performed later, the formula can be fully expanded into:
Now that the fully expanded Error Function has been found, the gradient descent formulas to optimize the regression line can be computed.
1.c. The Partial Derivatives of the Error Function
Taking the Partial Derivative of with respect to :
Taking the Partial Derivative of with respect to :
After these calculations, the partial derivatives of the error function are as follows:
These Partial Derivatives are now ready to be plugged into the Gradient Descent.
1.d. The Gradient Descent
The gradient descent is an optimization technique that seeks to locate a minimum value for each coefficient of the regression line. A good way to think about Gradient Descent is to visualize someone hiking down a mountain. Every 5 or so steps, the hiker must look at their surroundings and determine what the steepest path down is, and then take that path. After they have gone another 5 steps, the hiker checks their surroundings again, determines the steepest path down, and uses that process until they reach the ground.
In this example, those 5 steps the hiker takes can be called the Learning Rate in the gradient descent process. This is a coefficient for the partial derivatives that controls how much the model will try to fit the line to the data. The closer the learning rate is to 0, the more precise the model should be.
Note: This can lead to the model being "overfitted", so an optimized version of the learning rate should also be found through testing.
The gradient descent function is calculated by iteratively subtracting the partial derivative from the current value of or , respectively, so our equations for the learning rate look like this:
Where .
Now that the Math for the functions have been worked out, the python code to create the simple linear regression model is ready to be written. [Rewrite this]
2. Applying the Theory to Python
2.a. The Gradient Descent Function in Python
Short description about the gradient descent
def gradient_descent(m_current, b_current, df, learning_rate): m_gradient = 0 b_gradient = 0 n = len(df) # The number of rows in the dataset # Calculate the partial derivative summations for i in range(n): x = df.iloc[i].x y = df.iloc[i].y # These are a pythonic representation of partial derivative equations found in the theory section m_gradient += (-2/n) * x * (y - (m_current * x + b_current)) b_gradient += (-2/n) * (y - (m_current * x + b_current)) # Calculate the Gradient Descent equations from the theory section m = m_current - learning_rate * m_gradient b = b_current - learning_rate * b_gradient return m,b
2.b. Performing Simple Linear Regression
Now that the gradient descent function is complete, a function to iteratively call that function is needed to minimize the error of the regression line
def ols_regression(learning_rate, iterations, df): m = 0 b = 0 for i in range(iterations): m,b = gradient_descent(m, b, df, learning_rate) return m,b
2.c. Testing the Linear Regression Model
To apply the OLS linear regression functions, the program will take a dataset and perform the regression on it. The dataset being used is just a simple CSV file by luddarell from kaggle.
import pandas as pd # Import the Data file = '/path/to/repos/github/portfolio-backtesting/docs/data/1.01_Simple_linear_regression.csv' df = pd.read_csv(file) df2 = pd.DataFrame() df2['y'] = df['SAT'] df2['x'] = df['GPA'] df = df2
Finally, set the learning rate and number of iterations, then call the linear regression function:
# Run the Regression Model: learning_rate = .001 iterations = 10000 m,b = ols_regression(learning_rate, iterations, df) print(m,b)
477.47759913775326 250.49383109375495
2.d. Plotting the Regression Line
Now that the regression line has been computed, the line should be visualized to gain a better understanding of what the data analysis performed looks like.
import seaborn as sns sns.regplot(x ='x', y='y',data=df)
From this plot, the line's fit of the data can be clearly seen and the variance is shown with the shadows around the line.
3-4. Multiple Linear Regression
3. The Theory
Regression models take a series of predictor(X) variables and a single response(Y) variable, and estimates a line of best fit that can be used to predict unknown response variables.
This regression model can be applied to any series of predictor and response variables, however for the purpose of the pythonic-finance project, this model will be used in the Fama-French 3 and 5 factor analyses of portfolios, which will have a brief overview in the python section, where the beta coefficients for this model will be calculated.
3.a. The Model
The Multiple Linear Regression Model is:
Which can be translated into the Matrix Form:
Setting allows the matrices to be the same size, which simplifies the calculations by including the Y-intercept Beta() in the coefficient matrix.
3.b. The Error Function
The error function that will be minimized in the model is the Sum of Squared Errors, which measures variation within a cluster of data.
The Formula for the Sum Squared Errors() is:
This is a sum of each of the squared differences between the observed response variable and the estimated response variable .
The Matrix form of the SSE formula is:
Instead of squaring the matrices, the error matrix is multiplied by its transpose. This is done because the errors are an matrix, and computing is not possible, so the transpose is used instead.
An expansion of this equation using vectors is provided below:
3.c. Computing the Error Function for the MLR Model
In Linear Algebra, the transpose of a sum can be decomposed in the following ways:
Which means the transpose operator in can be distributed, making the function:
Substituting the matrix form into the error function returns:
In order to finish simplifying the equations, the following terms must be proven equal in order to simplify into the solution :
Let :
Therefore the equation becomes
Therefore
Substituting this back into the equation allows it to be simplified.
3.d. The Partial Derivative of the Error Function
Now that the error function is expanded to include the equation of the MLR Model, the partial derivative of the error function can be computed.
The partial derivative is used to compute how much the error within the model is changing, and is iteratively calculated to minimize each coefficient . It is important to note that this partial derivative is merely an estimate, as the data is a series of discrete observations and not continuous.
The vector of the minimized values of each are labeled in the Normal Equations, which are the results of the minimization process.
Then setting the partial derivative equal to 0 and solving for , the equation becomes:
Lastly, the Normal Equations can be found by rearranging the equation:
3.e. Interpreting the Theory to Translate into an Algorithm
is the Least Squares Estimator for the model. This means that it is a coefficient matrix that can take the observed and values and estimate a value for each in the model.
In the Simple Linear Regression Model, the Formula for the Gradient Descent of the slope was:
The Normal Equations bypass this iterative gradient descent process, and perform the gradient descent in one step. By plugging the dataset into the Normal Equations formula for , the optimal coefficients for each predictor variable are computed without needing iteration like in the Simple Linear Regression Model.
4. Applying the Theory to Python
The necessary packages for this section are NumPy, Pandas, YFinance, MatPlotLib and Seaborn.
The CSV Data for the 3 and 5 Factor Models can be found in Dr. Kenneth French's Data Library and will be downloaded and added to a CSV file in another Python Script to reduce the complexity of this notebook. The CSV files with the combined datasets will be available here.
import numpy as np import pandas as pd import yfinance as yf import matplotlib.pyplot as plt
4.a. Function Definitions
Because the Normal Equations solve for the gradient descent in one step, the Multiple Linear Regression function needs to take a DataFrame and convert it into the necessary NumPy arrays, then compute each part of the Normal Equations to solve for . Recall that the formula for the vector is:
def multiple_linear_regression(df:pd.DataFrame)->pd.DataFrame: x = df.filter(like='x_').values y = df.filter(like='y_').values xT = x.T xTx = np.dot(xT, x) xTx_inv = np.linalg.inv(xTx) xTy = np.dot(xT,y) betas = np.dot(xTx_inv, xTy) return betas
4.b. Accessing the Data and Performing Multiple Linear Regression
The Data was written to a CSV file using the Script 'dataset_creator.py' in the /notebooks/regression_models/multiple_linear_regression_files/ directory in the project. This CSV contains the Fama-French Library Data, as well as the Stock returns for AMD.
ff_3_df = pd.read_csv('/path/to/repos/github/pythonic-finance/notebooks/regression_models/multiple_linear_regression_files/ff_3_factor.csv') ff_5_df = pd.read_csv('/path/to/repos/github/pythonic-finance/notebooks/regression_models/multiple_linear_regression_files/ff_5_factor.csv')
Now that the data has been accessed, the Multiple Linear Regression can be run on each dataset:
4.b.i. The 3-Factor Model
The Fama-French 3 Factor Model is an extension of the Capital Asset Pricing Model, aiming to describe a stock or portfolio's returns through market risk as well as the outperformance of small-cap companies relative to large-cap companies and the outperformance of high market-to-book value companies relative to low market-to-book value companies.
The model suggests that both small-cap stocks and stocks with a high market-to-book ratio tend to regularly outperform the overall market, and thus should be factored into the model.
This data can be found in Dr. Kenneth French's data library and will be used for this model.
The formula for the 3-factor model is:
Where:
is the expected rate of return
is the risk-free rate
= Small Minus Big, the historic excess returns of small-caps over large-caps
= High Minus Low, the historic excess returns of high market-to-book ratio companies over low market-to-book ratio companies
are the coefficients of each factor, estimated by the regression model
is the excess return on investment
is the noise within the model
4.b.ii. The 3-Factor Model in Python
betas_3_factor = pd.DataFrame(multiple_linear_regression(ff_3_df)).T new_cols = {0:'alpha', 1:'mkt-rf', 2:'smb', 3:'hml'} betas_3_factor.rename(columns=new_cols, inplace=True) betas_3_factor
alpha | mkt-rf | smb | hml |
---|---|---|---|
0.117517 | 1.520856 | 0.104164 | -0.78295 |
4.b.iii. The 5-Factor Model
The Fama-French 5-Factor model is another iteration of the 3-Factor Model, including 2 new factors. These are:
-
= Robust Minus Weak, the average return on two robust operating-profitability portfolios minus the average return on two weak operating-profitability portfolios.
-
= Conservative Minus Aggressive, the average return on two conservative investment portfolios minus the average return on two aggressive investment portfolios.
These Factors are also found in Dr. Kenneth French's data library, and will be used for this model.
The formula for the 5-Factor Model is:
4.b.iv. The 5-Factor Model in Python
betas_5_factor = pd.DataFrame(multiple_linear_regression(ff_5_df)).T new_cols = {0:'alpha', 1:'mkt-rf', 2:'smb', 3:'hml', 4:'rmw', 5:'cma'} betas_5_factor.rename(columns=new_cols, inplace=True) betas_5_factor
Doing this returns the following values:
alpha | mkt-rf | smb | hml | rmw | cma |
---|---|---|---|---|---|
0.127646 | 1.467586 | -0.067626 | -0.557057 | -0.151023 | -0.574484 |
4.c. Visualizing the Regression Results
Now that the Regression Results have been computed for both the 3 and 5 factor models, some visualizations are needed to examine what the results look like. Currently, I am trying to put these notebooks together for a school project, so the results are not yet finished. This will be updated later, but for now I need to get these posts done. Check back soon!
The Functions used in this Notebook will be translated into Python Classes within the scripts folder of this project. The last part of this project involves generating ANOVA Tables for each model, and then the OOP implementations will begin to be added. If the programs are not here when you are reading this, they will be soon so check back later : )
References
Simple Linear Regression
NeuralNine. "Linear Regression from Scratch in Python." NeuralNine, NeuralNine, https://www.neuralnine.com/linear-regression-from-scratch-in-python/
Multiple Linear Regression
Dash, Debidutta. "Multiple Linear Regression from scratch using only numpy." Medium, 2022, https://medium.com/analytics-vidhya/multiple-linear-regression-from-scratch-using-only-numpy-98fc010a1926.
Ramesh, Bhanumathi. "Deriving Normal Equation for Multiple Linear Regression." Medium, 2022, https://medium.com/@bhanu0925/deriving-normal-equation-for-multiple-linear-regression-85241965ee3b.
LearnChemE. "Matrix Approach to Multiple Linear Regression." YouTube, uploaded by LearnChemE, 2022, https://youtu.be/NzuK4iAfxhU?si=cxU-v8ZBgbA1s-FG.
Boer Commander. "Matrix Form Multiple Linear Regression MLR." YouTube, uploaded by Boer Commander, 2022, https://youtu.be/Imjfp1cxy6g?si=gWXnA9F_XisVzFA4.
French, Kenneth R. "Data Library." Kenneth R. French - Data Library, Tuck School of Business at Dartmouth, https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html.