Archive for the ‘time series’ Category

Derivation of simple linear regression formula

May 20, 2018 Leave a comment

Finally, I have a time to post another mathematics blog post after absence for almost two years. This time I will discuss formula of simple linear regression. Suppose we have a set of data as follow :

(x_1,y_1),  (x_2,y_2), \ldots, (x_n,y_n).

We are going to fit those points using a linear equation Y = \beta_0 + \beta_1 X. This classical problem is known as a simple linear regression and is usually taught in elementary statistics class around the world. However, due to the rise of computer, students are only given the formula to compute the best estimation of \beta_0 and \beta_1 without the knowledge on how to derive them. These formula are:

\beta_1 = \displaystyle{\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}} , and

\beta_0 = \displaystyle{\bar{y}-\beta_1 \bar{x}}


where \bar{x}, \bar{y} are the average of x_i and y_i respectively. In this post, I am going to derive these formula using four methods. The goal is always the same, this note will serve as a reminder for me when I forget how to derive these formula. If you learn statistics, these are many statistical assumptions such that this model and this formula work, but we are not going to discuss that here.


First, let us take a look at the following figure

We see that there are five points that are fitted using a line. To derive \beta_0 and \beta_1 using a Calculus, we have to define the so-called error function, E(\beta_0, \beta_1). The error function is defined as the sum of square error of each data. The formula is given as follow,

E(\beta_0,\beta_1)=\displaystyle{\sum_{i=1}^n (y_i - \hat{y}_i)^2=\sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2},

where \hat{y}_i is the prediction of the y coordinate using the fitted line. If we take a look at the figure above, the error function is the sum of squares of all the red lines. It is obvious that whenever \beta_0 and \beta_1 change, the error function changes. Therefore we have to find \beta_0 and \beta_1 such that the error function is minimal. This is a minimization problem and in Calculus we already know about how to minimize function of two variables. First we have to find the first partial derivative of E with respect to \beta_0 and \beta_1 and equate them to zero, i.e.

\frac{\partial E}{\partial \beta_0 }=0 and \frac{\partial E}{\partial \beta_1 }=0.

And we have to find \beta_o and \beta_1 that satistfy the above equations. Second, we have to check whether or not \beta_o and \beta_1 is the minimum or the maximum point or it could be a saddle point. However, our error function E is a convex function, therefore the solution of the first derivative equation always yields a minimum point. Let us find out what is the expression of the first derivative of E,

\frac{\partial E}{\partial \beta_0} = \sum_{i=1}^n (-2)(y_i-(\beta_0+\beta_1 x_i)) =0 and \frac{\partial E}{\partial \beta_1} = \sum_{i=1}^n (-2x_i)(y_i-(\beta_0+\beta_1 x_i)) =0.

Tidying up the above two equations we will get :

\sum_{i=1}^n y_i = \beta_0 n + \beta_1 \sum_{i=1}^n x_i, and \sum_{i=1}^n x_iy_i = \beta_0 \sum_{i=1}^n x_i + \beta_1 \sum_{i=1}^n x_i^2,

which are a system of linear equations with two variables. We can solve the above linear equations using Cramer’s rule or any other methods, and we should get the following solution :

\beta_0 = \frac{(\sum_{i=1}^n y_i)(\sum_{i=1}^n x_i^2)-(\sum_{i=1}^n x_i)(\sum_{i=1}^n x_iy_i)}{n(\sum_{i=1}^n x_i^2)-(\sum_{i=1}^n x_i)^2} and

\beta_1 = \frac{n(\sum_{i=1}^n x_iy_i)-(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n(\sum_{i=1}^n x_i^2)-(\sum_{i=1}^n x_i)^2}.

With a little bit of algebra manipulation, the above form is the same as the one in (1). And this is the end of the Calculus part.


In this approach, we assume that one is familiar with the concept of mathematical statistics such as random variables, expectations, variances and co-variances. Suppose X and Y are random variables with mean \mu_X and \mu_Y respectively. Random variable X and Y also have variances \sigma_X^2 and \sigma_Y^2 and they also have co-variance Cov(X, Y).

We want to find \beta_0 and \beta_1 such that random variable Y and \beta_0 + \beta_1 X is as close as possible. How are going to measure “as close as possible?” Let us define another random variable Z=(Y-(\beta_0 + \beta_1 X))^2. We also need to define a function g as a function of \beta_0, \beta_1 as follow :

g(\beta_0, \beta_1) = E(Z) = E[(Y-\beta_0 - \beta_1 X)^2].

Therefore, our problem will be similar to our approach in the first section. We need to find \beta_0 and \beta_1 such that g is minimum or expectation of Z is minimum. Expanding the equation above, we obtain:

g(\beta_0, \beta_1)=E(Y^2)-2\beta_0E(Y)-2\beta_1E(XY)+ \beta_0^2+2\beta_0\beta_1E(X)+\beta_1^2E(X^2).

The above equation is quadratic in \beta_0 and \beta_1 so the zeros of the first derivatives of the above expression is the minimum point. The first derivatives with respect to \beta_0 and \beta_1 are :

\frac{\partial g}{\partial \beta_0} = -2E(Y)+2\beta_0+2\beta_1E(X)=0,

\frac{\partial g}{\partial \beta_1} = -2E(XY)+2\beta_0E(X)+2\beta_1E(X^2)=0,

which we can rewrite as:



Multiplying the first equation by E(X) and subtracting it to the second equation, we obtain:

\beta_1 = \frac{E(XY)-E(X)E(Y)}{E(X^2)-E(X)^2} = \frac{Cov(X,Y)}{\sigma_X^2},


\beta_0 = E(Y) - \beta_1 E(X)=\mu_Y - \beta_1 \mu_X.

The above solution is in the form of random variables. However, what we have is data which is in the form (x_1,y_1),  (x_2,y_2), \ldots, (x_n,y_n). So, instead of \mu_X we have \bar{x}, and instead of \mu_Y we have \bar{y}. The same also applies on variances and co-variance, so in the end, the above solution is the same as equation (1). This is the end of the statistics approach.

Linear Algebra

This time, we are going to derive the formula using linear algebra approach. Personally, I think this approach is the most elegant of all other approaches. However, I need to say that one must know a little bit about linear algebra first before reading this, otherwise it would not make sense. So, using a set of data (x_1,y_1),  (x_2,y_2), \ldots, (x_n,y_n), we want to find \beta_0 and \beta_1. Plugging all the data into the linear equation y = \beta_0 + \beta_1 x, we will obtain:

\beta_0 + \beta_1 x_1 = y_1,
\beta_0 + \beta_1 x_2 = y_2,
\beta_0 + \beta_1 x_n = y_n.

The above systems can be rewritten in matrix form, as follow:

\left( \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array}\right) \left( \begin{array}{c} \beta_0 \\ \beta_1 \end{array}\right)= \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right),


Ax =w,

where A is a n \times 2 matrix, x = (\beta_0,\beta_1)^T and w is a vector containing y_1, \dots, y_n. In linear algebra, the equation above is known as the over-determined linear equation. It may have a solution but mostly it does not have one. One method to solve such system is using the theory of projection. Suppose we have a linear equation Ax=w. According to the theory of linear algebra, this equation has a solution only if w is inside the column space of A. In the above case, where we want to find \beta_0 and \beta_1, our w is clearly not in the column space of A.

Therefore, to find the solution of the linear equation, we must bring w to the column space of matrix A. We are going to project w to the column space of A. We can do many different projections but according to the theory of linear algebra, we must do an orthogonal projections such that our estimation of \beta_0 and \beta_1 will bring Ax “closest” to w. Let us now discuss the projection theory first.

Consider the following figure.

Suppose we want to project w perpendicularly to the vector space V. Suppose w' is the result of the orthogonal projection and V is the column space of A. Then the linear equation Ax=w' has a solution since w' is inside the column space of A. Consider V^{\perp} which is the orthogonal complement of V. Clearly, vector w-w' is inside this space. From the theory of linear algebra, we know that if V is the column space of A, V^{\perp} is the null space of A^T. Therefore we have A^T (w-w')=0, or A^tw'=A^Tw. Since we know that w' = Ax then we have A^TA x = A^T w. If the matrix A^TA is invertible (mostly it is in our case) we can find the solution of our problem which is x = (A^TA)^{-1}A^T w.

That’s it, we have done it. If we substitute back every variable to our original variable, we will have our solution, i.e.

A = \left( \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array}\right), x= \left( \begin{array}{c} \beta_0 \\ \beta_1 \end{array}\right), w = \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right).

With a little bit of algebra (again), we can easily show that the formula x=(A^TA)^{-1}A^Tw is the same as the equation (1).

Numerical approach

Our last approach is using the theory of numerical methods. Since this approach is numerical, we are not going to get the solution as in the equation (1). This approach will only try to approximate it. This approach will use some of the result obtained in the Calculus approach, which is the error function i.e.,

E(\beta_0,\beta_1)=\displaystyle{\sum_{i=1}^n (y_i - \hat{y}_i)^2=\sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2}.

However, we are not going to label it as an error function, instead we are going to label it as a cost function. We say this function as a cost function because every time we pick the wrong \beta_0 and \beta_1 there is a cost that we have to pay. So we have to find the right \beta_0 and \beta_1 such that the cost is minimum.

There are many numerical methods to solve such a problem, but at this time we are going to discuss only one methods, i.e. gradient descent. The basic idea is we start with arbitrary values of \beta_0 and \beta_1 and we are going to continually change the value of \beta_0 and \beta_1 in such a way so that the cost function decreases. Hopefully, we end up at the value where the cost function is minimum.

The following is the general procedure of the gradient descent method :

repeat until convergence {

\beta_j := \beta_j - \alpha \frac{\partial}{\partial \beta_j} E(\beta_0,\beta_1)     (for j=0 and j=1)


The above procedure will be executed continually until the value of \beta_0 and \beta_1 convergent, which means that we are getting close to the value where the cost function is minimum. The symbol \alpha is called the learning rate. It always takes a positive value and if this variable is too large, \beta_j will change drastically and we sometimes cannot get the result that we want. But if \alpha is too small, our computation will be very slow, although it is guaranteed that we will get the minimum cost function.

Finally, we need to know some Calculus to derive the partial derivative of the cost function with respect to \beta_0 and \beta_1. For the readers who know Calculus, they can derive themselves, and we will obtain :

\frac{\partial }{\partial \beta_0}E=-2\sum_{i=1}^n(y_i - (\beta_0 + \beta_1 x_i)),

\frac{\partial }{\partial \beta_1}E=-2\sum_{i=1}^n(y_i - (\beta_0 + \beta_1 x_i))x_i.

Let’s plug this into the gradient descent algorithm and we will have:

repeat until convergence {

\beta_0 := \beta_0 - \alpha (-2\sum_{i=1}^n(y_i - (\beta_0 + \beta_1 x_i)))

\beta_1 := \beta_1 - \alpha (-2\sum_{i=1}^n(y_i - (\beta_0 + \beta_1 x_i))x_i)


Let’s take an example. Here is a simple python code to illustrate the idea.

import math
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1, 2, 3, 4, 5])
y = np.array([1, 2, 2.5, 4, 4.5])
b0,b1 = 0.0, 0.1
val_x = np.linspace(1.0,5.0,100)
val_y = b0+b1*val_x

Running the above code, we will get the following picture.

The red dots are the data that we are going to fit. The blue line is just one example of \beta_0 and \beta_1 that of course, is a poor fit. We are going to fix it using gradient descent method.

def costfunction(b0,b1,x,y):
    sum = 0
    n = len(x)
    for i in range(0,n):
        sum = sum + (y[i]-(b0+b1*x[i]))**2
return sum
def grad(b0,b1,x,y):
    n = len(x)
    sum1,sum2 = 0,0
    for i in range(0,n):
        sum1 = sum1 + (y[i]-(b0+b1*x[i]))
        sum2 = sum2 + (y[i]-(b0+b1*x[i]))*x[i]
return [(-2)*sum1, (-2)*sum2]
b0,b1 = 0.0, 0.1
maxiter = 300
alpha = 0.005
c = np.zeros(maxiter)
c[0] = costfunction(b0,b1,x,y)
for i in range(1,maxiter):
    v = grad(b0,b1,x,y)
    temp1 = b0 - alpha*v[0]
    temp2 = b1 - alpha*v[1]
    b0 = temp1
    b1 = temp2
    c[i] = costfunction(b0,b1,x,y)
print("$\beta_0$ = ",b0,"$\beta_1$ = ",b1)
val_x = np.linspace(1.0,5.0,100)
val_y = b0+b1*val_x

Running the second code above, we would get the desired result.  From the figure above, we see that our fit is close enough with our data. 

The theoretical value of \beta_0 and \beta_1 are 0.1 and 0.9 respectively, but in the code above we will obtain only approximation value that was obtained by performing 300 iterations.


Random Numbers in Matlab

May 23, 2014 3 comments

In the past two years, I have been supervising bachelor degree students for their final projects. Unfortunately, most of the projects were related to time series, forecasting, stochastics process, financial mathematics and many topics related to application to statistics in finance. To be honest, this is not my strong suit. As a result, we struggled when reading and using many statistical techniques especially the one in time series. With this notes, I hope to understand and to keep remember those techniques. Moreover, as I am teaching Time Series Analysis this semester, I believe this post will be very useful for my students.

In this post, I concentrate on how to generate random numbers with certain distributions in Matlab and its application in my time series lecture. I know that we can easily google on how to do these, but as always, it would be very nice to have this ready in my blog. When I forget to do this, I don’t have to google it again and save myself a few minutes.

Generating random numbers and introduction to subplot, histogram, state commands in matlab

To generate random numbers in Matlab with certain distributions, I  need to type

x1 = normrnd(mean,std,[m,n]); %normal distribution
x2 = binornd(N,p,[m,n]); %binomial distribution
x3 = exprnd(lambda,[m,n]); %exponential distribution
x4 = gamrnd(a,b,[m,n]); %Gamma distribution
x5 = rand(m,n); %uniform distribution
x6 = poissrnd(lambda,[m,n]); %Poisson distribution

The commands above returns an m x n matrix containing pseudorandom values drawn from normal, binomial, exponential, gamma, uniform and Poisson distributions respectively. Some usefull commands related to random number generator in Matlab are

subplot(2,2,1) %top left figure
hist(x1) %histogram of normal random numbers
subplot(2,2,2) %top right figure
hist(x2) %histogram of binomial random numbers
subplot(2,2,3) %bottom left figure
hist(x3) %histogram of exponential random numbers
subplot(2,2,4) %bottom right figure
hist(x4) %histogram of Gamma random numbers

The above commands return histograms of random numbers generated by the previous commands.  The output of those commands is the following figure.


Another useful command regarding random number generators is the ‘state’ option. This is very usefull if we want to repeat our computation that involves random number. The following code will help us understand how to use the state command.

theta = -0.8;
phi = 0.9;
mu = 0;
sigma = 0.1;
%-----generating random numbers-------------------------
e = normrnd(mu,sigma,[100 1]);
%-----generating MA(1) and AR(1) process----------------
y(1) = 0;
z(1) = 1;
for i = 2:length(e)
y(i) = e(i) - theta*e(i-1);
z(i) = phi*z(i-1) + e(i);
%-----plotting MA(1) and AR(1) process----------------

Using the above command, especially when we write the ‘ state’ option in line 8, it allows us to repeat our computation later. We can show the result of our computation tomorrow exactly the same as our computation we conduct today, even though our code involves a random number.  The result of the above command is plotted in the following figure.


Plotting the auto correlation function and introduction to bar command

In time series analysis, when we have a time series, it is common to plot the sample auto correlation function (ACF) to be compared to time series models we have in the textbooks. The thing is today, our students now rely heavily on statistical softwares to plot the ACF so that they forget how to plot it in the beginning.  Recall that the sample ACF of the observed series Y_1, Y_2, \dots, Y_n is defined as

(1)——–\displaystyle{r_k = \frac{\sum_{t=k+1}^{n}(Y_t - \bar{Y})(Y_{t-k}-\bar{Y})}{\sum_{t=1}^{n}(Y_t - \bar{Y})^2}}

The following code shows a MATLAB code that will plot the ACF of both time series generated by the code above.

sumY = 0;
sumZ = 0;
for i = 1:length(y)
sumY = sumY + y(i);
sumZ = sumZ + z(i);
ybar = sumY/length(y);
zbar = sumZ/length(z);
sum2Y = 0;
sum2Z = 0;
for i = 1:length(y)
sum2Y = sum2Y + (y(i)-ybar)^2;
sum2Z = sum2Z + (z(i)-zbar)^2;
for k = 1:length(y)
sum3Y = 0;
sum3Z = 0;
for t = k+1 : length(y)
sum3Y = sum3Y + (y(t)-ybar)*(y(t-k)-ybar);
sum3Z = sum3Z + (z(t)-zbar)*(z(t-k)-zbar);
ry(k) = sum3Y/sum2Y;
rz(k) = sum3Z/sum2Z;

The last two commands above will plot the ACF of time series y and z in the bar style as it is always for the ACF, as shown in the following figure.




It doesn’t show, really, the supposed to be acf of MA(1) (above) and AR(1) (below) process. Perhaps this is I have not learned  the sample acf measurement (1).

How to export/import data between Matlab and Excell

Finally, we are going to write a code in Matlab on how to export or import data to/from Excell in Matlab. Again, this is to save myself a few minutes rather than googling it around the internet.

%--------------write data to excell-----------------------
y = transpose(y);
z = transpose(z);
filename = 'ma_ar_data.xlsx';
%-------------read data from excell----------------------
filename = 'ma_ar_data.xlsx';
sheet = 1;
xlRange = 'A1:A100';
dataY = xlsread(filename, sheet, xlRange);

The transpose command above puts the variable y generated before to be a column vector.


In this post, I have created various codes so that I am able to remember

  • generating random numbers
  • using subplot
  • using histogram and bar
  • exporting/importing data to/from excell
  • plotting the ACF