Monday, August 31, 2015

Data Pre-processing

Download the data (here), Rcodes (here) to replicate the results of the video of this blog. The video is given at the end of the blog.

When we talk about the economic forecasting, compare to theoretical models, a-theoretical model (model with no economic theory) performs better. However, the data in a-theoretical model is differently pre-processed compare to data pre-processing in theoretical model. An economist can learn many lessons from machine learning algorithm and can have superior advantages to use them when forecasting. The entire process of forecasting can be roughly represented in following diagram.



At first, I would like to discuss about the data pre-processing. Data pre-processing techniques generally refer to the addition, deletion, or transformation of training set data. The data can be pre-processed in following steps.

1. Transformation (centering, scaling, skewness transformations, transformation to resolve the outliers)
2. Dealing with Missing Variables via data imputation
3. Data Reductions and Feature Extractions
4. Filtering (Removal of Redundant Predictors)
5. Binning Predictors (Development of Dummy Variables)

In this blog, I would like to show you how we can pre-process the data. The dimension of the tutorial data is 190 x 118 ie. There are 190 rows for 118 variables; hence there is no missing data. Now let’s to do transformation to make data more normal, then center and scale the data then lets filter the redundant variables and finally lets perform the feature extractions. For this we will use “caret” and “corrplot” package.

At First lets load the data and load the required packages:
index <- read.csv("Data Preprocessing.csv")
library("caret")
library("corrplot")

The data can be transformed using “BoxCox” method or “YeoJohnson” method. These transformations will correct the skewness of data and can made data look more normal. But, unlike “BoxCox” method which required all the data strictly to be positive, We will use “YeoJohnson” method because our data ranges positive as well as negative.

After we make data look more normal we will center and scale the data. To center the data, we simply subtract the average of that data from all the values i.e. $\left( \overline{X}-{{X}_{i}} \right)$. After centering, each predictor will have zero mean. Similarly to scale data, each data of variable is divided by the standard deviation of that variable i.e.  $\left( \frac{{{X}_{i}}}{\sqrt{\sum\limits_{i=1}^{n}{{{({{X}_{i}}-\overline{X})}^{2}}/N}}} \right)$. After scaling, each predictor will have a unit standard deviation. Hence, centering and scaling means to center the variable to zero mean and unit standard deviation. After this the usuall process is to deal with the missing variable, fortunately we don’t have the missing variable. To deal with missing variable with a K-nearest neighbor model in which a new sample is imputed by finding the samples “closest” to it and averages these nearby points to fill in the value. However, one should know “why” there is missing data? And can such imputation really necessary?

The “YeoJohnson” transformation, centering and scaling of data can be done with one single “preProcess“ command from the “caret” package. Then such transformation can be imposed in data using “predict” command of same package.

trans <- preProcess(data, method = c("YeoJohnson", "center", "scale"))
trans
trans.data <- predict(trans, data)

After, this to reduce the computational burned, one can filter and eliminate the redundant variable. If two variables are highly correlated then that represents that both variables are measuring almost same underlying information. Removing one doesn’t compromise the performance of the model and might lead to a more parsimonious and interpretable model. This also suppress the multicollinearity problem.

At first let’s develop a correlation matrix object called “correlations” uusing “cor” command, then lets find the variables with absolute values of pair-wise correlations more than 0.75 value using “findCorrelation” command of “corrplot” package and save them in object called “highCorr”. Then, at last, let’s remove those variables. 

correlations <- cor(trans.data)
highCorr <- findCorrelation(correlations, cutoff = .75)
filter.data<- trans.data[ , -highCorr]
dim(filter.data)

Now, we can also do feature extraction implementing Principle component analysis. PCA is a data reduction technique which generates a smaller set of predictors that seek to capture a majority of the information in the original variables. This method seeks to find linear combinations of the predictors, known as principal components (PCs), which capture the most possible variance. 

trans.pca <- preProcess(filter.data, method = "pca"))
trans.pca
filter.trans.data <- predict(trans.pca, filter.data)

In the example of blog, the initial dimesion of data was 190 x 118, then we perform the  “YeoJohnson” transformation then center and scale the variable. Further we remove the redundant variable to suppress the multicollinearity problem then the dimension of filtered data was  190 x  59. Finally, we performed the PCA and we found 36 PC can explain 95% variability of data and the dimension of final data was 36 x 190.

Instead reducing the redundant variable, try to transform, center, scale and extract PCA all at once by following command. You will see it will also extract 36 PC.

trans.all <- preProcess(data, method = c("YeoJohnson", "center", "scale", "pca"))
trans.all
trans.all.pca <- predict(trans.all, data)

Finally, to generate the correlation plot use “findCorrelation” command and to find the variable loadings, in which the rows correspond to predictor variables and columns are associated with the components use trans.pca$rotation command.

highCorr <- findCorrelation(correlations, cutoff = .75)

The plot looks like following:



Here is the Video for more elaborations:



Reference:
Kuhn and Johnson (2013), Applied Predictive Modeling, Springer, New York

Wednesday, August 26, 2015

Partially Linear Models and Use of Spline

Before proceeding to this blog, please kindly watch my previous blog (here). 

Let’s try to fit following model:
\[\log (wage)={{\beta }_{1}}+g(experience)+{{\beta }_{3}}experienc{{e}^{2}}+{{\beta }_{2}}education+{{\beta }_{2}}ethinicity+\varepsilon \]
Here, $g$  is an unknown function which has to be estimated from the data. We will try to fit a spline. A spline is a numeric function that is piecewise-defined by polynomial functions, and which possesses a high degree of smoothness at the places where the polynomial pieces connect (which are known as knots).  Spline with order (or the degree of the piecewise polynomial) 3 represents a cubic spline (see here for more).

First, let’s find the best order of spline based on the AIC and BIC. You can find the blog (here) for where to use AIC and BIC. We will use bs() command for spline and its order as df. This command is available in spline package however when we load AER package, spline package is automatically loaded.

First, let’s generate an object called “spline.reg” which has the fitted linear models with spline with degree of 3 to 20 using lapply() command  then let's extract the best model as per minimum AIC and using sapply(spline.reg, AIC) and sapply(spline.reg, AIC) save this object called “Best.AIC” and “Best.BIC” as the matrix form.  

library(AER)
data("CPS1988")
spline.reg <- lapply(3:10, function(i)
              lm(log(wage) ~ bs(experience, df = i) + education + ethnicity, 
                  data = CPS1988))

Then we can find the minimum AIC and BIC values using Best.AIC[which.min(Best.AIC)] and Best.AIC[which.min(Best.AIC)] command. Finally we can see the order of best model. Since the matrix starts from 3 to 20 lags, make sure to add extra 2 to find the best order.

Best.AIC=as.matrix(sapply(spline.reg, AIC))
Best.AIC[which.min(Best.AIC)]
Best.AIC

Best.BIC=as.matrix(sapply(spline.reg, BIC))
Best.BIC[which.min(Best.BIC)]
Best.BIC

As per AIC the best spline should have df=15 and as per BIC the best spline should have df=5. Lets develop the partial regression as “p.reg1” and “p.reg2” as follow:

p.reg1 <- lm(log(wage) ~ bs(experience, df = 15) + education + ethnicity, data = CPS1988)
p.reg2 <- lm(log(wage) ~ bs(experience, df = 5) + education + ethnicity, data = CPS1988)

Visualization:

As the model have DV as log of real wage and IDV as variable experience, education and ethnicity. To plot the data against the log of wage against experience we can use plot command. However, to visualize the quadratic relation, we will take experience from 0 to 60 from the CPS1988 data, for the value of education we will use the mean education of Caucasian ethnicity. This data subset is stored in objected call “data”.

data <- data.frame(experience = 0:60, education = with(CPS1988, mean(education[ethnicity == "cauc"])), ethnicity = "cauc")

Then let’s use multiple regressions which we did in the previous blog (here):

reg1 <- lm(log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988)

Now, let’s predict the coefficient of the regression based on the “reg1”, “p.reg1” and “p.reg2” model using data of “data”.

data$yhat1=predict(reg1, newdata=data)
data$yhat2=predict(p.reg1, newdata=data)
data$yhat3=predict(p.reg2, newdata=data)

Now, let's plot the log of wage against the experience using CPS1988 data, we will use jitter command for experience for the sense of density and we will plot in grey color and do alpha bending (transparent or opaque depend upon values of alpha)
plot(log(wage) ~ jitter(experience, factor = 3), pch = 19, col = rgb(0.5, 0.5, 0.5, alpha = 0.02), data =          CPS1988)

Now, in this plot, we can plot the quadratic relation which we have achieved implementing “reg1” model in data frame called “data”. Similarly, we can also plot the partial regression achieved by “p.reg1” and p.reg2” model. Let’s do that:

lines(yhat1 ~ experience, data = data, lty = 1, col = 1, lwd = 2)
lines(yhat2 ~ experience, data = data, lty = 1, col = 2, lwd = 1)
lines(yhat3 ~ experience, data = data, lty = 1, col = 4, lwd = 3 )
legend("topleft", c("quadratic", "spline with df = 5", "spline with df = 15"), 
       lty = c(1,1,1), col = c(1,6,4), lwd = c(2,1,3), 
       bty = "n")


Interpretation

The plot of “reg1” shows only the quadratic relation only whereas the p.reg1 over estimates the data and is far away from the application in real life however, the “p.reg2” model shows that for nearly 2-3 years of experience the wage rises quickly but nearly 7-8 years of experience the wage increases passively and wage increases sluggishly from 8-30 years of experience and after 30 years of experience the wage tend to fall.

Reference:

C. Kleiber, A. Zeileis, Applied Econometrics with R,
DOI: 10.1007/978-0-387-77318-6 1, © Springer Science+Business Media, LLC 2008

Tuesday, August 25, 2015

Multiple Linear Regression and Models Comparision

Let’s install package called AER using the code install.packages("AER") then load the package into to R using command called library(AER). Now, Let’s use the CPS1988 data. This data is collected by Current Population Survey (CPS) by the US Census Bureau. The data is cross-section on males aged 18 to 70 with positive annual income greater than US$ 50 in 1992 who are not self-employed or working without pay. Wages are deflated by the deflator of personal consumption expenditures for 1992. A summary of the data set can be obtained using command called summary(CPS1988). 

Try to answer following question (I have answered all these question in video below):
Which are the quantitative and categorical variables?

The wage is the wage in dollars per week, education and experience are measured in years, and ethnicity is a factor with levels Caucasian ("cauc") and African-American ("afam"). There are three further factors, smsa, region, and parttime, indicating residence in a standard metropolitan statistical area (SMSA), the region within the United States of America, and whether the individual works part-time.

Let’s use the model:
\[\log (wage)={{\beta }_{1}}+{{\beta }_{2}}experience+{{\beta }_{3}}experienc{{e}^{2}}+{{\beta }_{2}}education+{{\beta }_{2}}ethinicity+\varepsilon\]
The model is in semilogarithmic form and also has the squared regressor i.e $experienc{{e}^{2}}$. For that lets use the lm() command and store the result in reg1 object. Then to see the summary of that reg1 object let’s use summary(reg1) command.

reg1 <- lm(log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988) summary(reg1)

Try to answer following question:
Is all the sign of the regressor are as per expected?
Why we insulated with   to specify the $experienc{{e}^{2}}$?
What does the negative sigh of $I(experienc{{e}^{2}})$ represents?
Which co-efficient shows the return on education? And what does that mean?
What does this co-efficient means and how to interpret?

ANOVA 
Let do an ANOVA in reg1 using the command called anova(reg1). Try to answer following:
Where you can get the total variance?
Of that total variance, how much variance is explained by the variable of model and residual?
How the F-ratio is calculated?

Comparison of Model by evaluating ANOVA
Let’s make another model called reg2 same as reg1 instead this time we will drop the ethinicity variable. Using following command:

reg2 = update(reg1, formula = . ~ . - ethnicity)
summary(reg2)

The expression . ~ . - ethnicity specifies to take the LHS and RHS in the formula (signaled by the “.”), only removing ethnicity on the RHS. You can also find ANOVA using anova(reg2).

Now let’s compare model reg2 with reg1 using the command anova(reg1, reg2). The model which has less RSS is better. Try to answer following question:

Which model is better?
Where you can find the RSS of each model?

Comparison of Model with wald test.
Now let’s install the package called install.packages("lmtest") then use library(lmtest). Then, let’s perform a wald test in reg1 reducing ethnicity variable.

waldtest(reg1, . ~ . - ethnicity)

Try to answer following question:
Even thou the wald test here gave the same result with previous ANOVA comparison, but Why Wald test is sometime prefer compare to ANOVA comparison?

By the way the plot of log of wage with experience look like following:


Here is the video in which I have answered all those questions:



References:
C. Kleiber, A. Zeileis, Applied Econometrics with R, DOI: 10.1007/978-0-387-77318-6 1, © Springer Science+Business Media, LLC 2008

Saturday, August 1, 2015

Simple Regression in Matrix Notation

Here is the video. Check it before you read.


Here in the table, there are $n=25$ observations of variable $Y$ and variable $X$.


Say we state that $Y$ depends on $X$ and in functional form it can be written as: $Y=f(X)$. Say, such function is linear then: ${{Y}_{i}}={{\beta }_{0}}+{{\beta }_{1}}{{X}_{i}}$. The true values of Y-intercept ${{\beta }_{0}}$ and slope ${{\beta }_{1}}$ are only known by GOD.

A man however can do some estimate for them. Estimate of $Y$ is $\hat{Y}$ and ${{\hat{Y}}_{i}}={{\hat{\beta }}_{0}}+{{\hat{\beta }}_{1}}{{X}_{i}}$ where ${{\hat{\beta }}_{0}}$ is estimate of ${{\beta }_{0}}$and ${{\hat{\beta }}_{1}}$ is estimate of ${{\beta }_{1}}$. For estimate a man can however try to make the sums of deviations between ${{Y}_{i}}$and ${{\hat{Y}}_{i}}$zero i.e. $\sum\limits_{i=1}^{n}{\left( {{Y}_{i}}-\hat{Y} \right)}=0$. However, the zero cannot be easily expressed and interpreted hence, a man tried to find another way: he tried to minimize the sum of squares of deviation. Here note that, ${{Y}_{i}}={{\hat{\beta }}_{0}}+{{\hat{\beta }}_{1}}{{X}_{i}}+{{\varepsilon }_{i}}$.

Hence this function can be written as to minimize sum of the squares of deviation $S$ :
\[\begin{align}
& S=\sum\limits_{i=1}^{n}{{{({{Y}_{i}}-\hat{Y})}^{2}}} \\
& or,S=\sum\limits_{i=1}^{n}{{{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}^{2}}}\to (i) \\
\end{align}\]
For such, we can take first order derivatives w.r.t ${{\hat{\beta }}_{0}}$and ${{\hat{\beta }}_{1}}$ and set them to zero. At first let’s take first order derivatives w.r.t ${{\hat{\beta }}_{0}}$

\[\begin{align}
  & \frac{\partial S}{\partial {{{\hat{\beta }}}_{0}}}=\frac{\partial \sum\limits_{i=1}^{n}{{{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}^{2}}}}{\partial {{{\hat{\beta }}}_{0}}} \\
 & or,\frac{\partial S}{\partial {{{\hat{\beta }}}_{0}}}=\frac{\partial \sum\limits_{i=1}^{n}{{{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}^{2}}}}{\partial ({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}\frac{\partial ({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}{\partial {{{\hat{\beta }}}_{0}}} \\
 & or,\frac{\partial S}{\partial {{{\hat{\beta }}}_{0}}}=2\sum\limits_{i=1}^{n}{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}(-1) \\
 & or,\frac{\partial S}{\partial {{{\hat{\beta }}}_{0}}}=-2\sum\limits_{i=1}^{n}{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})} \\
\end{align}\]
Let’s set the above equation to zero
\[\begin{align}
  & 0=-2\sum\limits_{i=1}^{n}{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})} \\
 & or,\sum\limits_{i=1}^{n}{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}=0 \\
 & or,\sum\limits_{i=1}^{n}{{{Y}_{i}}}-\sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{0}}-}\sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{1}}{{X}_{i}}}=0 \\
 & \therefore \sum\limits_{i=1}^{n}{{{Y}_{i}}}=n{{{\hat{\beta }}}_{0}}+{{{\hat{\beta }}}_{1}}\sum\limits_{i=1}^{n}{{{X}_{i}}}\to (ii) \\
\end{align}\]
Let’s take first order derivatives w.r.t ${{\hat{\beta }}_{1}}$
\[\begin{align}
  & \frac{\partial S}{\partial {{{\hat{\beta }}}_{1}}}=\frac{\partial \sum\limits_{i=1}^{n}{{{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}^{2}}}}{\partial {{{\hat{\beta }}}_{1}}} \\
 & or,\frac{\partial S}{\partial {{{\hat{\beta }}}_{1}}}=\frac{\partial \sum\limits_{i=1}^{n}{{{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}^{2}}}}{\partial ({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}\frac{\partial ({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}{\partial {{{\hat{\beta }}}_{1}}} \\
 & or,\frac{\partial S}{\partial {{{\hat{\beta }}}_{0}}}=2\sum\limits_{i=1}^{n}{({{Y}_{i}}-{{{\hat{\beta }}}_{0}}-{{{\hat{\beta }}}_{1}}{{X}_{i}})}(-{{X}_{i}}) \\
 & or,\frac{\partial S}{\partial {{{\hat{\beta }}}_{0}}}=-2\sum\limits_{i=1}^{n}{({{Y}_{i}}{{X}_{i}}-{{{\hat{\beta }}}_{0}}{{X}_{i}}-{{{\hat{\beta }}}_{1}}X_{i}^{2})} \\
\end{align}\]
Let’s set the above equation to zero
\[\begin{align}
  & 0=-2\sum\limits_{i=1}^{n}{({{Y}_{i}}{{X}_{i}}-{{{\hat{\beta }}}_{0}}{{X}_{i}}-{{{\hat{\beta }}}_{1}}X_{i}^{2})} \\
 & or,\sum\limits_{i=1}^{n}{({{Y}_{i}}{{X}_{i}}-{{{\hat{\beta }}}_{0}}{{X}_{i}}-{{{\hat{\beta }}}_{1}}X_{i}^{2})}=0 \\
 & or,\sum\limits_{i=1}^{n}{{{Y}_{i}}{{X}_{i}}}-\sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{0}}X}-\sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{1}}X_{i}^{2}}=0 \\
 & \therefore \sum\limits_{i=1}^{n}{{{Y}_{i}}{{X}_{i}}}=\sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{0}}X}+\sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{1}}X_{i}^{2}}\to (iii) \\
\end{align}\]
Let’s consider equation (ii) and (iii) and expressed them in matrix form:
\[\begin{align}
  & \sum\limits_{i=1}^{n}{{{Y}_{i}}}=n{{{\hat{\beta }}}_{0}}+{{{\hat{\beta }}}_{1}}\sum\limits_{i=1}^{n}{{{X}_{i}}} \\
 & \sum\limits_{i=1}^{n}{{{Y}_{i}}{{X}_{i}}}=\sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{0}}X}+\sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{1}}X_{i}^{2}} \\
 & or,\left[ \begin{matrix}
   \sum\limits_{i=1}^{n}{{{Y}_{i}}}  \\
   \sum\limits_{i=1}^{n}{{{Y}_{i}}{{X}_{i}}}  \\
\end{matrix} \right]=\left[ \begin{matrix}
   n{{{\hat{\beta }}}_{0}}+{{{\hat{\beta }}}_{1}}\sum\limits_{i=1}^{n}{{{X}_{i}}}  \\
   \sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{0}}X}+\sum\limits_{i=1}^{n}{{{{\hat{\beta }}}_{1}}X_{i}^{2}}  \\
\end{matrix} \right] \\
 & \therefore \left[ \begin{matrix}
   \sum\limits_{i=1}^{n}{{{Y}_{i}}}  \\
   \sum\limits_{i=1}^{n}{{{Y}_{i}}{{X}_{i}}}  \\
\end{matrix} \right]=\left[ \begin{matrix}
   n & \sum\limits_{i=1}^{n}{{{X}_{i}}}  \\
   \sum\limits_{i=1}^{n}{X} & \sum\limits_{i=1}^{n}{X_{i}^{2}}  \\
\end{matrix} \right]\left[ \begin{matrix}
   {{{\hat{\beta }}}_{0}}  \\
   {{{\hat{\beta }}}_{1}}  \\
\end{matrix} \right]\to (iv) \\
\end{align}\]
Now, we can solve equation (iv) which is known as normal equation. This is in traditional algebraic method and can be solved. However, I wish to introduce the matrix terms form here. So let’s develop some matrix notation.

Let’s consider matrix now.

Y: ${{\left[ \begin{matrix}
   10  \\
   11  \\
   12  \\
   \vdots   \\
   12  \\
\end{matrix} \right]}_{25*1}}={{\left[ \begin{matrix}
   {{Y}_{1}}  \\
   {{Y}_{2}}  \\
   {{Y}_{2}}  \\
   \vdots   \\
   {{Y}_{25}}  \\
\end{matrix} \right]}_{25*1}}$. This has 25 rows and 1 column. Say, matrixes of ones can be given as I =${{\left[ \begin{matrix}
   1  \\
   1  \\
   1  \\
   \vdots   \\
   1  \\
\end{matrix} \right]}_{25*1}}$. For matrix of X we will column wise concatenate the matrix I with observations of X. X =${{\left[ \begin{matrix}
   \begin{matrix}
   1  \\
   1  \\
   \vdots   \\
   1  \\
   1  \\
\end{matrix} & \begin{matrix}
   35  \\
   29  \\
   \vdots   \\
   33  \\
   28  \\
\end{matrix}  \\
\end{matrix} \right]}_{25*2}}={{\left[ \begin{matrix}
   \begin{matrix}
   1  \\
   1  \\
   \vdots   \\
   1  \\
   1  \\
\end{matrix} & \begin{matrix}
   {{X}_{1}}  \\
   {{X}_{2}}  \\
   \vdots   \\
   {{X}_{24}}  \\
   {{X}_{25}}  \\
\end{matrix}  \\
\end{matrix} \right]}_{25*2}}$. The matrix of estimates can be given as: = ${{\left[ \begin{matrix}
   {{{\hat{\beta }}}_{0}}  \\
   {{{\hat{\beta }}}_{1}}  \\
\end{matrix} \right]}_{2*1}}$ and matrix of errors can be represented as: Є =${{\left[ \begin{matrix}
   {{\varepsilon }_{1}}  \\
   {{\varepsilon }_{2}}  \\
   \vdots   \\
   {{\varepsilon }_{24}}  \\
   {{\varepsilon }_{25}}  \\
\end{matrix} \right]}_{25*1}}$. Then equation ${{Y}_{i}}={{\hat{\beta }}_{0}}+{{\hat{\beta }}_{1}}{{X}_{i}}+{{\varepsilon }_{i}}$ can be expressed as: Y = XB + Є. Here, XB is not the same as BX.

Let’s consider the transposition. 


Y’ = ${{\left[ \begin{matrix}
   \begin{matrix}
   10 & 11 & 12 & \cdots  & 12  \\
\end{matrix}  \\
\end{matrix} \right]}_{1*25}}={{\left[ \begin{matrix}
   \begin{matrix}
   {{Y}_{1}} & {{Y}_{2}} & {{Y}_{3}} & \cdots  & {{Y}_{25}}  \\
\end{matrix}  \\
\end{matrix} \right]}_{1*25}}$ ,

I’ =${{\left[ \begin{matrix}
   \begin{matrix}
   1 & 1 & 1 & \cdots  & 1  \\
\end{matrix}  \\
\end{matrix} \right]}_{1*25}}$,

X’ = ${{\left[ \begin{matrix}
   1 & 1 & \cdots  & 1 & 1  \\
   35 & 29 & \cdots  & 33 & 28  \\
\end{matrix} \right]}_{2*25}}={{\left[ \begin{matrix}
   1 & 1 & \cdots  & 1 & 1  \\
   {{X}_{1}} & {{X}_{2}} & \cdots  & {{X}_{24}} & {{X}_{25}}  \\
\end{matrix} \right]}_{2*25}}$ ,

B’= ${{\left[ \begin{matrix}
   {{{\hat{\beta }}}_{0}} & {{{\hat{\beta }}}_{1}}  \\
\end{matrix} \right]}_{1*2}}$ ,

Є’ = ${{\left[ \begin{matrix}
   {{\varepsilon }_{1}} & {{\varepsilon }_{2}} & \cdots  & {{\varepsilon }_{24}} & {{\varepsilon }_{25}}  \\
\end{matrix} \right]}_{1*25}}$ .

Now, Є’Є = $\varepsilon _{1}^{2}+\varepsilon _{2}^{2}+\cdots +\varepsilon _{24}^{2}+\varepsilon _{25}^{2}=\sum\limits_{i=1}^{n}{\varepsilon _{i}^{2}}$ . Similarly, Y’Y = $\sum\limits_{i=1}^{n}{Y_{i}^{2}}$,

1’Y = $\sum\limits_{i=1}^{n}{{{Y}_{i}}}$  and

Y’II’Y/n = $\frac{\sum\limits_{i=1}^{n}{{{Y}_{i}}}}{n}$.

Let’s consider
X’X =
${{\left[ \begin{matrix}
   1 & 1 & \cdots  & 1 & 1  \\
   {{X}_{1}} & {{X}_{2}} & \cdots  & {{X}_{24}} & {{X}_{25}}  \\
\end{matrix} \right]}_{2*25}}{{\left[ \begin{matrix}
   \begin{matrix}
   1  \\
   1  \\
   \vdots   \\
   1  \\
   1  \\
\end{matrix} & \begin{matrix}
   {{X}_{1}}  \\
   {{X}_{2}}  \\
   \vdots   \\
   {{X}_{24}}  \\
   {{X}_{25}}  \\
\end{matrix}  \\
\end{matrix} \right]}_{25*2}}=\left[ \begin{matrix}
   n & \sum\limits_{i=1}^{n}{{{X}_{i}}}  \\
   \sum\limits_{i=1}^{n}{{{X}_{i}}} & \sum\limits_{i=1}^{n}{X_{i}^{2}}  \\
\end{matrix} \right]\to (v)$

Let’s consider X’Y =

${{\left[ \begin{matrix}
   1 & 1 & \cdots  & 1 & 1  \\
   {{X}_{1}} & {{X}_{2}} & \cdots  & {{X}_{24}} & {{X}_{25}}  \\
\end{matrix} \right]}_{2*25}}{{\left[ \begin{matrix}
   {{Y}_{1}}  \\
   {{Y}_{2}}  \\
   {{Y}_{2}}  \\
   \vdots   \\
   {{Y}_{25}}  \\
\end{matrix} \right]}_{25*1}}=\left[ \begin{matrix}
   \sum\limits_{i=1}^{n}{{{Y}_{i}}}  \\
   \sum\limits_{i=1}^{n}{{{X}_{i}}{{Y}_{i}}}  \\
\end{matrix} \right]\to (vi)$

If we consider equation (iv)
\[\left[ \begin{matrix}
   \sum\limits_{i=1}^{n}{{{Y}_{i}}}  \\
   \sum\limits_{i=1}^{n}{{{Y}_{i}}{{X}_{i}}}  \\
\end{matrix} \right]=\left[ \begin{matrix}
   n & \sum\limits_{i=1}^{n}{{{X}_{i}}}  \\
   \sum\limits_{i=1}^{n}{X} & \sum\limits_{i=1}^{n}{X_{i}^{2}}  \\
\end{matrix} \right]\left[ \begin{matrix}
   {{{\hat{\beta }}}_{0}}  \\
   {{{\hat{\beta }}}_{1}}  \\
\end{matrix} \right]\] then,

X’Y = X’XB holds. Now to solve for B, we can do
B = (X’X)-1 X’Y
Where, (X’X)-1  is the inverse matrix of X’X.

Interestingly,
Ŷ = XB and putting values of B we get
ŶX(X’X)-1 X’Y and let consider X(X’X)-1 X’ = H or hat matrix. The dimension of H is n*n. then
Ŷ = HY

The hat matrix H is symmetric and idempotent i.e HH=H

The residuals given as Є, like the fitted values of Ŷ can be expressed as linear combinations of the response variable observations Y .

Є = Y - Ŷ = Y – HY = (I – H)Y.

Now, the matrix normal equation can be derived by minimization of: S = Є’Є = (Y - XB)'(Y - XB) w.r.t B.