### Archive

Archive for the ‘time series’ Category

## Derivation of simple linear regression formula

Finally, I have a time to post another mathematics blog post after absence for almost two years. This time I will discuss formula of simple linear regression. Suppose we have a set of data as follow :

$(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$.

We are going to fit those points using a linear equation $Y = \beta_0 + \beta_1 X$. This classical problem is known as a simple linear regression and is usually taught in elementary statistics class around the world. However, due to the rise of computer, students are only given the formula to compute the best estimation of $\beta_0$ and $\beta_1$ without the knowledge on how to derive them. These formula are:

$\beta_1 = \displaystyle{\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}$, and

$\beta_0 = \displaystyle{\bar{y}-\beta_1 \bar{x}}$

(1)

where $\bar{x}, \bar{y}$ are the average of $x_i$ and $y_i$ respectively. In this post, I am going to derive these formula using four methods. The goal is always the same, this note will serve as a reminder for me when I forget how to derive these formula. If you learn statistics, these are many statistical assumptions such that this model and this formula work, but we are not going to discuss that here.

Calculus

First, let us take a look at the following figure

We see that there are five points that are fitted using a line. To derive $\beta_0$ and $\beta_1$ using a Calculus, we have to define the so-called error function, $E(\beta_0, \beta_1)$. The error function is defined as the sum of square error of each data. The formula is given as follow,

$E(\beta_0,\beta_1)=\displaystyle{\sum_{i=1}^n (y_i - \hat{y}_i)^2=\sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2}$,

where $\hat{y}_i$ is the prediction of the $y$ coordinate using the fitted line. If we take a look at the figure above, the error function is the sum of squares of all the red lines. It is obvious that whenever $\beta_0$ and $\beta_1$ change, the error function changes. Therefore we have to find $\beta_0$ and $\beta_1$ such that the error function is minimal. This is a minimization problem and in Calculus we already know about how to minimize function of two variables. First we have to find the first partial derivative of $E$ with respect to $\beta_0$ and $\beta_1$ and equate them to zero, i.e.

$\frac{\partial E}{\partial \beta_0 }=0$ and $\frac{\partial E}{\partial \beta_1 }=0$.

And we have to find $\beta_o$ and $\beta_1$ that satistfy the above equations. Second, we have to check whether or not $\beta_o$ and $\beta_1$ is the minimum or the maximum point or it could be a saddle point. However, our error function $E$ is a convex function, therefore the solution of the first derivative equation always yields a minimum point. Let us find out what is the expression of the first derivative of $E$,

$\frac{\partial E}{\partial \beta_0} = \sum_{i=1}^n (-2)(y_i-(\beta_0+\beta_1 x_i)) =0$ and $\frac{\partial E}{\partial \beta_1} = \sum_{i=1}^n (-2x_i)(y_i-(\beta_0+\beta_1 x_i)) =0$.

Tidying up the above two equations we will get :

$\sum_{i=1}^n y_i = \beta_0 n + \beta_1 \sum_{i=1}^n x_i$, and $\sum_{i=1}^n x_iy_i = \beta_0 \sum_{i=1}^n x_i + \beta_1 \sum_{i=1}^n x_i^2$,

which are a system of linear equations with two variables. We can solve the above linear equations using Cramer’s rule or any other methods, and we should get the following solution :

$\beta_0 = \frac{(\sum_{i=1}^n y_i)(\sum_{i=1}^n x_i^2)-(\sum_{i=1}^n x_i)(\sum_{i=1}^n x_iy_i)}{n(\sum_{i=1}^n x_i^2)-(\sum_{i=1}^n x_i)^2}$ and

$\beta_1 = \frac{n(\sum_{i=1}^n x_iy_i)-(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n(\sum_{i=1}^n x_i^2)-(\sum_{i=1}^n x_i)^2}$.

With a little bit of algebra manipulation, the above form is the same as the one in (1). And this is the end of the Calculus part.

Statistics

In this approach, we assume that one is familiar with the concept of mathematical statistics such as random variables, expectations, variances and co-variances. Suppose $X$ and $Y$ are random variables with mean $\mu_X$ and $\mu_Y$ respectively. Random variable $X$ and $Y$ also have variances $\sigma_X^2$ and $\sigma_Y^2$ and they also have co-variance $Cov(X, Y)$.

We want to find $\beta_0$ and $\beta_1$ such that random variable $Y$ and $\beta_0 + \beta_1 X$ is as close as possible. How are going to measure “as close as possible?” Let us define another random variable $Z=(Y-(\beta_0 + \beta_1 X))^2$. We also need to define a function $g$ as a function of $\beta_0, \beta_1$ as follow :

$g(\beta_0, \beta_1) = E(Z) = E[(Y-\beta_0 - \beta_1 X)^2]$.

Therefore, our problem will be similar to our approach in the first section. We need to find $\beta_0$ and $\beta_1$ such that $g$ is minimum or expectation of $Z$ is minimum. Expanding the equation above, we obtain:

$g(\beta_0, \beta_1)=E(Y^2)-2\beta_0E(Y)-2\beta_1E(XY)+ \beta_0^2+2\beta_0\beta_1E(X)+\beta_1^2E(X^2)$.

The above equation is quadratic in $\beta_0$ and $\beta_1$ so the zeros of the first derivatives of the above expression is the minimum point. The first derivatives with respect to $\beta_0$ and $\beta_1$ are :

$\frac{\partial g}{\partial \beta_0} = -2E(Y)+2\beta_0+2\beta_1E(X)=0$,

$\frac{\partial g}{\partial \beta_1} = -2E(XY)+2\beta_0E(X)+2\beta_1E(X^2)=0$,

which we can rewrite as:

$E(Y)=\beta_0+\beta_1E(X)$,

$E(XY)=\beta_0E(X)+\beta_1E(X^2)$,

Multiplying the first equation by $E(X)$ and subtracting it to the second equation, we obtain:

$\beta_1 = \frac{E(XY)-E(X)E(Y)}{E(X^2)-E(X)^2} = \frac{Cov(X,Y)}{\sigma_X^2}$,

and

$\beta_0 = E(Y) - \beta_1 E(X)=\mu_Y - \beta_1 \mu_X$.

The above solution is in the form of random variables. However, what we have is data which is in the form $(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$. So, instead of $\mu_X$ we have $\bar{x}$, and instead of $\mu_Y$ we have $\bar{y}$. The same also applies on variances and co-variance, so in the end, the above solution is the same as equation (1). This is the end of the statistics approach.

Linear Algebra

This time, we are going to derive the formula using linear algebra approach. Personally, I think this approach is the most elegant of all other approaches. However, I need to say that one must know a little bit about linear algebra first before reading this, otherwise it would not make sense. So, using a set of data $(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$, we want to find $\beta_0$ and $\beta_1$. Plugging all the data into the linear equation $y = \beta_0 + \beta_1 x$, we will obtain:

$\beta_0 + \beta_1 x_1 = y_1$,
$\beta_0 + \beta_1 x_2 = y_2$,
$\vdots$
$\beta_0 + \beta_1 x_n = y_n$.

The above systems can be rewritten in matrix form, as follow:

$\left( \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array}\right) \left( \begin{array}{c} \beta_0 \\ \beta_1 \end{array}\right)= \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right)$,

or

$Ax =w$,

where $A$ is a $n \times 2$ matrix, $x = (\beta_0,\beta_1)^T$ and $w$ is a vector containing $y_1, \dots, y_n$. In linear algebra, the equation above is known as the over-determined linear equation. It may have a solution but mostly it does not have one. One method to solve such system is using the theory of projection. Suppose we have a linear equation $Ax=w$. According to the theory of linear algebra, this equation has a solution only if $w$ is inside the column space of $A$. In the above case, where we want to find $\beta_0$ and $\beta_1$, our $w$ is clearly not in the column space of $A$.

Therefore, to find the solution of the linear equation, we must bring $w$ to the column space of matrix $A$. We are going to project $w$ to the column space of $A$. We can do many different projections but according to the theory of linear algebra, we must do an orthogonal projections such that our estimation of $\beta_0$ and $\beta_1$ will bring $Ax$ “closest” to $w$. Let us now discuss the projection theory first.

Consider the following figure.

Suppose we want to project $w$ perpendicularly to the vector space $V$. Suppose $w'$ is the result of the orthogonal projection and $V$ is the column space of $A$. Then the linear equation $Ax=w'$ has a solution since $w'$ is inside the column space of $A$. Consider $V^{\perp}$ which is the orthogonal complement of $V$. Clearly, vector $w-w'$ is inside this space. From the theory of linear algebra, we know that if $V$ is the column space of $A$, $V^{\perp}$ is the null space of $A^T$. Therefore we have $A^T (w-w')=0$, or $A^tw'=A^Tw$. Since we know that $w' = Ax$ then we have $A^TA x = A^T w$. If the matrix $A^TA$ is invertible (mostly it is in our case) we can find the solution of our problem which is $x = (A^TA)^{-1}A^T w$.

That’s it, we have done it. If we substitute back every variable to our original variable, we will have our solution, i.e.

$A = \left( \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array}\right), x= \left( \begin{array}{c} \beta_0 \\ \beta_1 \end{array}\right), w = \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right)$.

With a little bit of algebra (again), we can easily show that the formula $x=(A^TA)^{-1}A^Tw$ is the same as the equation (1).

Numerical approach

Our last approach is using the theory of numerical methods. Since this approach is numerical, we are not going to get the solution as in the equation (1). This approach will only try to approximate it. This approach will use some of the result obtained in the Calculus approach, which is the error function i.e.,

$E(\beta_0,\beta_1)=\displaystyle{\sum_{i=1}^n (y_i - \hat{y}_i)^2=\sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2}$.

However, we are not going to label it as an error function, instead we are going to label it as a cost function. We say this function as a cost function because every time we pick the wrong $\beta_0$ and $\beta_1$ there is a cost that we have to pay. So we have to find the right $\beta_0$ and $\beta_1$ such that the cost is minimum.

There are many numerical methods to solve such a problem, but at this time we are going to discuss only one methods, i.e. gradient descent. The basic idea is we start with arbitrary values of $\beta_0$ and $\beta_1$ and we are going to continually change the value of $\beta_0$ and $\beta_1$ in such a way so that the cost function decreases. Hopefully, we end up at the value where the cost function is minimum.

The following is the general procedure of the gradient descent method :

repeat until convergence {

$\beta_j := \beta_j - \alpha \frac{\partial}{\partial \beta_j} E(\beta_0,\beta_1)$     (for j=0 and j=1)

}

The above procedure will be executed continually until the value of $\beta_0$ and $\beta_1$ convergent, which means that we are getting close to the value where the cost function is minimum. The symbol $\alpha$ is called the learning rate. It always takes a positive value and if this variable is too large, $\beta_j$ will change drastically and we sometimes cannot get the result that we want. But if $\alpha$ is too small, our computation will be very slow, although it is guaranteed that we will get the minimum cost function.

Finally, we need to know some Calculus to derive the partial derivative of the cost function with respect to $\beta_0$ and $\beta_1$. For the readers who know Calculus, they can derive themselves, and we will obtain :

$\frac{\partial }{\partial \beta_0}E=-2\sum_{i=1}^n(y_i - (\beta_0 + \beta_1 x_i))$,

$\frac{\partial }{\partial \beta_1}E=-2\sum_{i=1}^n(y_i - (\beta_0 + \beta_1 x_i))x_i$.

Let’s plug this into the gradient descent algorithm and we will have:

repeat until convergence {

$\beta_0 := \beta_0 - \alpha (-2\sum_{i=1}^n(y_i - (\beta_0 + \beta_1 x_i)))$

$\beta_1 := \beta_1 - \alpha (-2\sum_{i=1}^n(y_i - (\beta_0 + \beta_1 x_i))x_i)$

}

Let’s take an example. Here is a simple python code to illustrate the idea.

import math
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1, 2, 3, 4, 5])
y = np.array([1, 2, 2.5, 4, 4.5])
b0,b1 = 0.0, 0.1
val_x = np.linspace(1.0,5.0,100)
val_y = b0+b1*val_x
plt.plot(x,y,'r.')
plt.plot(val_x,val_y,'b-')
plt.grid(True)
plt.show()


Running the above code, we will get the following picture.

The red dots are the data that we are going to fit. The blue line is just one example of $\beta_0$ and $\beta_1$ that of course, is a poor fit. We are going to fix it using gradient descent method.

def costfunction(b0,b1,x,y):
sum = 0
n = len(x)
for i in range(0,n):
sum = sum + (y[i]-(b0+b1*x[i]))**2
return sum
n = len(x)
sum1,sum2 = 0,0
for i in range(0,n):
sum1 = sum1 + (y[i]-(b0+b1*x[i]))
sum2 = sum2 + (y[i]-(b0+b1*x[i]))*x[i]
return [(-2)*sum1, (-2)*sum2]
b0,b1 = 0.0, 0.1
maxiter = 300
alpha = 0.005
c = np.zeros(maxiter)
c[0] = costfunction(b0,b1,x,y)
for i in range(1,maxiter):
temp1 = b0 - alpha*v[0]
temp2 = b1 - alpha*v[1]
b0 = temp1
b1 = temp2
c[i] = costfunction(b0,b1,x,y)
print("$\beta_0$ = ",b0,"$\beta_1$ = ",b1)
val_x = np.linspace(1.0,5.0,100)
val_y = b0+b1*val_x
plt.plot(x,y,'r.')
plt.plot(val_x,val_y,'b-')
plt.grid(True)
plt.show()


Running the second code above, we would get the desired result.  From the figure above, we see that our fit is close enough with our data.

The theoretical value of $\beta_0$ and $\beta_1$ are 0.1 and 0.9 respectively, but in the code above we will obtain only approximation value that was obtained by performing 300 iterations.

## Random Numbers in Matlab

May 23, 2014 3 comments

In the past two years, I have been supervising bachelor degree students for their final projects. Unfortunately, most of the projects were related to time series, forecasting, stochastics process, financial mathematics and many topics related to application to statistics in finance. To be honest, this is not my strong suit. As a result, we struggled when reading and using many statistical techniques especially the one in time series. With this notes, I hope to understand and to keep remember those techniques. Moreover, as I am teaching Time Series Analysis this semester, I believe this post will be very useful for my students.

In this post, I concentrate on how to generate random numbers with certain distributions in Matlab and its application in my time series lecture. I know that we can easily google on how to do these, but as always, it would be very nice to have this ready in my blog. When I forget to do this, I don’t have to google it again and save myself a few minutes.

Generating random numbers and introduction to subplot, histogram, state commands in matlab

To generate random numbers in Matlab with certain distributions, I  need to type

x1 = normrnd(mean,std,[m,n]); %normal distribution x2 = binornd(N,p,[m,n]); %binomial distribution x3 = exprnd(lambda,[m,n]); %exponential distribution x4 = gamrnd(a,b,[m,n]); %Gamma distribution x5 = rand(m,n); %uniform distribution x6 = poissrnd(lambda,[m,n]); %Poisson distribution

The commands above returns an m x n matrix containing pseudorandom values drawn from normal, binomial, exponential, gamma, uniform and Poisson distributions respectively. Some usefull commands related to random number generator in Matlab are

subplot(2,2,1) %top left figure hist(x1) %histogram of normal random numbers subplot(2,2,2) %top right figure hist(x2) %histogram of binomial random numbers subplot(2,2,3) %bottom left figure hist(x3) %histogram of exponential random numbers subplot(2,2,4) %bottom right figure hist(x4) %histogram of Gamma random numbers

The above commands return histograms of random numbers generated by the previous commands.  The output of those commands is the following figure.

Another useful command regarding random number generators is the ‘state’ option. This is very usefull if we want to repeat our computation that involves random number. The following code will help us understand how to use the state command.

clc; clear; theta = -0.8; phi = 0.9; mu = 0; sigma = 0.1; %-----generating random numbers------------------------- normrnd('state',100); e = normrnd(mu,sigma,[100 1]); %-----generating MA(1) and AR(1) process---------------- y(1) = 0; z(1) = 1; for i = 2:length(e) y(i) = e(i) - theta*e(i-1); z(i) = phi*z(i-1) + e(i); end %-----plotting MA(1) and AR(1) process---------------- subplot(2,1,1) plot(y,'r*-') subplot(2,1,2) plot(z,'b*-')

Using the above command, especially when we write the ‘ state’ option in line 8, it allows us to repeat our computation later. We can show the result of our computation tomorrow exactly the same as our computation we conduct today, even though our code involves a random number.  The result of the above command is plotted in the following figure.

Plotting the auto correlation function and introduction to bar command

In time series analysis, when we have a time series, it is common to plot the sample auto correlation function (ACF) to be compared to time series models we have in the textbooks. The thing is today, our students now rely heavily on statistical softwares to plot the ACF so that they forget how to plot it in the beginning.  Recall that the sample ACF of the observed series $Y_1, Y_2, \dots, Y_n$ is defined as

(1)——–$\displaystyle{r_k = \frac{\sum_{t=k+1}^{n}(Y_t - \bar{Y})(Y_{t-k}-\bar{Y})}{\sum_{t=1}^{n}(Y_t - \bar{Y})^2}}$

The following code shows a MATLAB code that will plot the ACF of both time series generated by the code above.

sumY = 0; sumZ = 0; for i = 1:length(y) sumY = sumY + y(i); sumZ = sumZ + z(i); end ybar = sumY/length(y); zbar = sumZ/length(z); sum2Y = 0; sum2Z = 0; for i = 1:length(y) sum2Y = sum2Y + (y(i)-ybar)^2; sum2Z = sum2Z + (z(i)-zbar)^2; end for k = 1:length(y) sum3Y = 0; sum3Z = 0; for t = k+1 : length(y) sum3Y = sum3Y + (y(t)-ybar)*(y(t-k)-ybar); sum3Z = sum3Z + (z(t)-zbar)*(z(t-k)-zbar); end ry(k) = sum3Y/sum2Y; rz(k) = sum3Z/sum2Z; end subplot(2,1,1); bar(ry); subplot(2,1,2); bar(rz);

The last two commands above will plot the ACF of time series $y$ and $z$ in the bar style as it is always for the ACF, as shown in the following figure.

It doesn’t show, really, the supposed to be acf of MA(1) (above) and AR(1) (below) process. Perhaps this is I have not learned  the sample acf measurement (1).

How to export/import data between Matlab and Excell

Finally, we are going to write a code in Matlab on how to export or import data to/from Excell in Matlab. Again, this is to save myself a few minutes rather than googling it around the internet.

%--------------write data to excell----------------------- y = transpose(y); z = transpose(z); filename = 'ma_ar_data.xlsx'; xlswrite(filename,y,1,'A1'); xlswrite(filename,z,1,'B1'); %-------------read data from excell---------------------- filename = 'ma_ar_data.xlsx'; sheet = 1; xlRange = 'A1:A100'; dataY = xlsread(filename, sheet, xlRange);

The transpose command above puts the variable $y$ generated before to be a column vector.

Conclusion

In this post, I have created various codes so that I am able to remember

• generating random numbers
• using subplot
• using histogram and bar
• exporting/importing data to/from excell
• plotting the ACF