## Derivation of simple linear regression formula

Finally, I have a time to post another mathematics blog post after absence for almost two years. This time I will discuss formula of simple linear regression. Suppose we have a set of data as follow :

.

We are going to fit those points using a linear equation . This classical problem is known as a simple linear regression and is usually taught in elementary statistics class around the world. However, due to the rise of computer, students are only given the formula to compute the best estimation of and without the knowledge on how to derive them. These formula are:

, and

(1)

where are the average of and respectively. In this post, I am going to derive these formula using four methods. The goal is always the same, this note will serve as a reminder for me when I forget how to derive these formula. If you learn statistics, these are many statistical assumptions such that this model and this formula work, but we are not going to discuss that here.

**Calculus**

First, let us take a look at the following figure

We see that there are five points that are fitted using a line. To derive and using a Calculus, we have to define the so-called error function, . The error function is defined as the sum of square error of each data. The formula is given as follow,

,

where is the prediction of the coordinate using the fitted line. If we take a look at the figure above, the error function is the sum of squares of all the red lines. It is obvious that whenever and change, the error function changes. Therefore we have to find and such that the error function is minimal. This is a minimization problem and in Calculus we already know about how to minimize function of two variables. First we have to find the first partial derivative of with respect to and and equate them to zero, i.e.

and .

And we have to find and that satistfy the above equations. Second, we have to check whether or not and is the minimum or the maximum point or it could be a saddle point. However, our error function is a convex function, therefore the solution of the first derivative equation always yields a minimum point. Let us find out what is the expression of the first derivative of ,

and .

Tidying up the above two equations we will get :

, and ,

which are a system of linear equations with two variables. We can solve the above linear equations using Cramer’s rule or any other methods, and we should get the following solution :

and

.

With a little bit of algebra manipulation, the above form is the same as the one in (1). And this is the end of the Calculus part.

**Statistics**

In this approach, we assume that one is familiar with the concept of mathematical statistics such as random variables, expectations, variances and co-variances. Suppose and are random variables with mean and respectively. Random variable and also have variances and and they also have co-variance .

We want to find and such that random variable and is as close as possible. How are going to measure “as close as possible?” Let us define another random variable . We also need to define a function as a function of as follow :

.

Therefore, our problem will be similar to our approach in the first section. We need to find and such that is minimum or expectation of is minimum. Expanding the equation above, we obtain:

.

The above equation is quadratic in and so the zeros of the first derivatives of the above expression is the minimum point. The first derivatives with respect to and are :

,

,

which we can rewrite as:

,

,

Multiplying the first equation by and subtracting it to the second equation, we obtain:

,

and

.

The above solution is in the form of random variables. However, what we have is data which is in the form . So, instead of we have , and instead of we have . The same also applies on variances and co-variance, so in the end, the above solution is the same as equation (1). This is the end of the statistics approach.

**Linear Algebra**

This time, we are going to derive the formula using linear algebra approach. Personally, I think this approach is the most elegant of all other approaches. However, I need to say that one must know a little bit about linear algebra first before reading this, otherwise it would not make sense. So, using a set of data , we want to find and . Plugging all the data into the linear equation , we will obtain:

,

,

.

The above systems can be rewritten in matrix form, as follow:

,

or

,

where is a matrix, and is a vector containing . In linear algebra, the equation above is known as the over-determined linear equation. It may have a solution but mostly it does not have one. One method to solve such system is using the theory of projection. Suppose we have a linear equation . According to the theory of linear algebra, this equation has a solution only if is inside the column space of . In the above case, where we want to find and , our is clearly not in the column space of .

Therefore, to find the solution of the linear equation, we must bring to the column space of matrix . We are going to project to the column space of . We can do many different projections but according to the theory of linear algebra, we must do an orthogonal projections such that our estimation of and will bring “closest” to . Let us now discuss the projection theory first.

Consider the following figure.

Suppose we want to project perpendicularly to the vector space . Suppose is the result of the orthogonal projection and is the column space of . Then the linear equation has a solution since is inside the column space of . Consider which is the orthogonal complement of . Clearly, vector is inside this space. From the theory of linear algebra, we know that if is the column space of , is the null space of . Therefore we have , or . Since we know that then we have . If the matrix is invertible (mostly it is in our case) we can find the solution of our problem which is .

That’s it, we have done it. If we substitute back every variable to our original variable, we will have our solution, i.e.

.

With a little bit of algebra (again), we can easily show that the formula is the same as the equation (1).

**Numerical approach**

Our last approach is using the theory of numerical methods. Since this approach is numerical, we are not going to get the solution as in the equation (1). This approach will only try to approximate it. This approach will use some of the result obtained in the Calculus approach, which is the error function i.e.,

.

However, we are not going to label it as an error function, instead we are going to label it as a cost function. We say this function as a cost function because every time we pick the wrong and there is a cost that we have to pay. So we have to find the right and such that the cost is minimum.

There are many numerical methods to solve such a problem, but at this time we are going to discuss only one methods, i.e. gradient descent. The basic idea is we start with arbitrary values of and and we are going to continually change the value of and in such a way so that the cost function decreases. Hopefully, we end up at the value where the cost function is minimum.

The following is the general procedure of the gradient descent method :

*repeat until convergence* *{*

(for j=0 and j=1)

*}*

The above procedure will be executed continually until the value of and convergent, which means that we are getting close to the value where the cost function is minimum. The symbol is called the learning rate. It always takes a positive value and if this variable is too large, will change drastically and we sometimes cannot get the result that we want. But if is too small, our computation will be very slow, although it is guaranteed that we will get the minimum cost function.

Finally, we need to know some Calculus to derive the partial derivative of the cost function with respect to and . For the readers who know Calculus, they can derive themselves, and we will obtain :

,

.

Let’s plug this into the gradient descent algorithm and we will have:

*repeat until convergence* *{*

*}*

Let’s take an example. Here is a simple python code to illustrate the idea.

import math import numpy as np import matplotlib.pyplot as plt x = np.array([1, 2, 3, 4, 5]) y = np.array([1, 2, 2.5, 4, 4.5]) b0,b1 = 0.0, 0.1 val_x = np.linspace(1.0,5.0,100) val_y = b0+b1*val_x plt.plot(x,y,'r.') plt.plot(val_x,val_y,'b-') plt.grid(True) plt.show()

Running the above code, we will get the following picture.

The red dots are the data that we are going to fit. The blue line is just one example of and that of course, is a poor fit. We are going to fix it using gradient descent method.

def costfunction(b0,b1,x,y): sum = 0 n = len(x) for i in range(0,n): sum = sum + (y[i]-(b0+b1*x[i]))**2 return sum def grad(b0,b1,x,y): n = len(x) sum1,sum2 = 0,0 for i in range(0,n): sum1 = sum1 + (y[i]-(b0+b1*x[i])) sum2 = sum2 + (y[i]-(b0+b1*x[i]))*x[i] return [(-2)*sum1, (-2)*sum2] b0,b1 = 0.0, 0.1 maxiter = 300 alpha = 0.005 c = np.zeros(maxiter) c[0] = costfunction(b0,b1,x,y) for i in range(1,maxiter): v = grad(b0,b1,x,y) temp1 = b0 - alpha*v[0] temp2 = b1 - alpha*v[1] b0 = temp1 b1 = temp2 c[i] = costfunction(b0,b1,x,y) print("$\beta_0$ = ",b0,"$\beta_1$ = ",b1) val_x = np.linspace(1.0,5.0,100) val_y = b0+b1*val_x plt.plot(x,y,'r.') plt.plot(val_x,val_y,'b-') plt.grid(True) plt.show()

Running the second code above, we would get the desired result. From the figure above, we see that our fit is close enough with our data.

The theoretical value of and are 0.1 and 0.9 respectively, but in the code above we will obtain only approximation value that was obtained by performing 300 iterations.

## Random Numbers in Matlab

In the past two years, I have been supervising bachelor degree students for their final projects. Unfortunately, most of the projects were related to time series, forecasting, stochastics process, financial mathematics and many topics related to application to statistics in finance. To be honest, this is not my strong suit. As a result, we struggled when reading and using many statistical techniques especially the one in time series. With this notes, I hope to understand and to keep remember those techniques. Moreover, as I am teaching Time Series Analysis this semester, I believe this post will be very useful for my students.

In this post, I concentrate on how to generate random numbers with certain distributions in Matlab and its application in my time series lecture. I know that we can easily google on how to do these, but as always, it would be very nice to have this ready in my blog. When I forget to do this, I don’t have to google it again and save myself a few minutes.

**Generating random numbers and introduction to subplot, histogram, state commands in matlab**

To generate random numbers in Matlab with certain distributions, I need to type

`x1 = normrnd(mean,std,[m,n]); %normal distribution`

x2 = binornd(N,p,[m,n]); %binomial distribution

x3 = exprnd(lambda,[m,n]); %exponential distribution

x4 = gamrnd(a,b,[m,n]); %Gamma distribution

x5 = rand(m,n); %uniform distribution

x6 = poissrnd(lambda,[m,n]); %Poisson distribution

The commands above returns an m x n matrix containing pseudorandom values drawn from normal, binomial, exponential, gamma, uniform and Poisson distributions respectively. Some usefull commands related to random number generator in Matlab are

`subplot(2,2,1) %top left figure`

hist(x1) %histogram of normal random numbers

subplot(2,2,2) %top right figure

hist(x2) %histogram of binomial random numbers

subplot(2,2,3) %bottom left figure

hist(x3) %histogram of exponential random numbers

subplot(2,2,4) %bottom right figure

hist(x4) %histogram of Gamma random numbers

The above commands return histograms of random numbers generated by the previous commands. The output of those commands is the following figure.

Another useful command regarding random number generators is the ‘state’ option. This is very usefull if we want to repeat our computation that involves random number. The following code will help us understand how to use the state command.

`clc;`

clear;

theta = -0.8;

phi = 0.9;

mu = 0;

sigma = 0.1;

%-----generating random numbers-------------------------

normrnd('state',100);

e = normrnd(mu,sigma,[100 1]);

%-----generating MA(1) and AR(1) process----------------

y(1) = 0;

z(1) = 1;

for i = 2:length(e)

y(i) = e(i) - theta*e(i-1);

z(i) = phi*z(i-1) + e(i);

end

%-----plotting MA(1) and AR(1) process----------------

subplot(2,1,1)

plot(y,'r*-')

subplot(2,1,2)

plot(z,'b*-')

Using the above command, especially when we write the ‘ state’ option in line 8, it allows us to repeat our computation later. We can show the result of our computation tomorrow exactly the same as our computation we conduct today, even though our code involves a random number. The result of the above command is plotted in the following figure.

**Plotting the auto correlation function and introduction to bar command**

In time series analysis, when we have a time series, it is common to plot the sample auto correlation function (ACF) to be compared to time series models we have in the textbooks. The thing is today, our students now rely heavily on statistical softwares to plot the ACF so that they forget how to plot it in the beginning. Recall that the sample ACF of the observed series is defined as

(1)——–

The following code shows a MATLAB code that will plot the ACF of both time series generated by the code above.

`sumY = 0;`

sumZ = 0;

for i = 1:length(y)

sumY = sumY + y(i);

sumZ = sumZ + z(i);

end

ybar = sumY/length(y);

zbar = sumZ/length(z);

sum2Y = 0;

sum2Z = 0;

for i = 1:length(y)

sum2Y = sum2Y + (y(i)-ybar)^2;

sum2Z = sum2Z + (z(i)-zbar)^2;

end

for k = 1:length(y)

sum3Y = 0;

sum3Z = 0;

for t = k+1 : length(y)

sum3Y = sum3Y + (y(t)-ybar)*(y(t-k)-ybar);

sum3Z = sum3Z + (z(t)-zbar)*(z(t-k)-zbar);

end

ry(k) = sum3Y/sum2Y;

rz(k) = sum3Z/sum2Z;

end

subplot(2,1,1);

bar(ry);

subplot(2,1,2);

bar(rz);

The last two commands above will plot the ACF of time series and in the bar style as it is always for the ACF, as shown in the following figure.

It doesn’t show, really, the supposed to be acf of MA(1) (above) and AR(1) (below) process. Perhaps this is I have not learned the sample acf measurement (1).

**How to export/import data between Matlab and Excell**

Finally, we are going to write a code in Matlab on how to export or import data to/from Excell in Matlab. Again, this is to save myself a few minutes rather than googling it around the internet.

`%--------------write data to excell-----------------------`

y = transpose(y);

z = transpose(z);

filename = 'ma_ar_data.xlsx';

xlswrite(filename,y,1,'A1');

xlswrite(filename,z,1,'B1');

%-------------read data from excell----------------------

filename = 'ma_ar_data.xlsx';

sheet = 1;

xlRange = 'A1:A100';

dataY = xlsread(filename, sheet, xlRange);

The transpose command above puts the variable generated before to be a column vector.

Conclusion

In this post, I have created various codes so that I am able to remember

- generating random numbers
- using subplot
- using histogram and bar
- exporting/importing data to/from excell
- plotting the ACF

## Your says