Download the data (here), Rcodes (here) to replicate the results of the video of this blog. The video is given at the end of the blog.
When we talk about the economic forecasting, compare to theoretical models, a-theoretical model (model with no economic theory) performs better. However, the data in a-theoretical model is differently pre-processed compare to data pre-processing in theoretical model. An economist can learn many lessons from machine learning algorithm and can have superior advantages to use them when forecasting. The entire process of forecasting can be roughly represented in following diagram.
At first, I would like to discuss about the data pre-processing. Data pre-processing techniques generally refer to the addition, deletion, or transformation of training set data. The data can be pre-processed in following steps.
1. Transformation (centering, scaling, skewness transformations, transformation to resolve the outliers)
2. Dealing with Missing Variables via data imputation
3. Data Reductions and Feature Extractions
4. Filtering (Removal of Redundant Predictors)
5. Binning Predictors (Development of Dummy Variables)
In this blog, I would like to show you how we can pre-process the data. The dimension of the tutorial data is 190 x 118 ie. There are 190 rows for 118 variables; hence there is no missing data. Now let’s to do transformation to make data more normal, then center and scale the data then lets filter the redundant variables and finally lets perform the feature extractions. For this we will use “caret” and “corrplot” package.
At First lets load the data and load the required packages:
index <- read.csv("Data Preprocessing.csv")
library("caret")
library("corrplot")
The data can be transformed using “BoxCox” method or “YeoJohnson” method. These transformations will correct the skewness of data and can made data look more normal. But, unlike “BoxCox” method which required all the data strictly to be positive, We will use “YeoJohnson” method because our data ranges positive as well as negative.
After we make data look more normal we will center and scale the data. To center the data, we simply subtract the average of that data from all the values i.e. $\left( \overline{X}-{{X}_{i}} \right)$. After centering, each predictor will have zero mean. Similarly to scale data, each data of variable is divided by the standard deviation of that variable i.e. $\left( \frac{{{X}_{i}}}{\sqrt{\sum\limits_{i=1}^{n}{{{({{X}_{i}}-\overline{X})}^{2}}/N}}} \right)$. After scaling, each predictor will have a unit standard deviation. Hence, centering and scaling means to center the variable to zero mean and unit standard deviation. After this the usuall process is to deal with the missing variable, fortunately we don’t have the missing variable. To deal with missing variable with a K-nearest neighbor model in which a new sample is imputed by finding the samples “closest” to it and averages these nearby points to fill in the value. However, one should know “why” there is missing data? And can such imputation really necessary?
The “YeoJohnson” transformation, centering and scaling of data can be done with one single “preProcess“ command from the “caret” package. Then such transformation can be imposed in data using “predict” command of same package.
trans
trans.data <- predict(trans, data)
After, this to reduce the computational burned, one can filter and eliminate the redundant variable. If two variables are highly correlated then that represents that both variables are measuring almost same underlying information. Removing one doesn’t compromise the performance of the model and might lead to a more parsimonious and interpretable model. This also suppress the multicollinearity problem.
At first let’s develop a correlation matrix object called “correlations” uusing “cor” command, then lets find the variables with absolute values of pair-wise correlations more than 0.75 value using “findCorrelation” command of “corrplot” package and save them in object called “highCorr”. Then, at last, let’s remove those variables.
highCorr <- findCorrelation(correlations, cutoff = .75)
filter.data<- trans.data[ , -highCorr]
dim(filter.data)
Now, we can also do feature extraction implementing Principle component analysis. PCA is a data reduction technique which generates a smaller set of predictors that seek to capture a majority of the information in the original variables. This method seeks to find linear combinations of the predictors, known as principal components (PCs), which capture the most possible variance.
trans.pca
filter.trans.data <- predict(trans.pca, filter.data)
In the example of blog, the initial dimesion of data was 190 x 118, then we perform the “YeoJohnson” transformation then center and scale the variable. Further we remove the redundant variable to suppress the multicollinearity problem then the dimension of filtered data was 190 x 59. Finally, we performed the PCA and we found 36 PC can explain 95% variability of data and the dimension of final data was 36 x 190.
Instead reducing the redundant variable, try to transform, center, scale and extract PCA all at once by following command. You will see it will also extract 36 PC.
trans.all <- preProcess(data, method = c("YeoJohnson", "center", "scale", "pca"))
trans.all
trans.all.pca <- predict(trans.all, data)
Finally, to generate the correlation plot use “findCorrelation” command and to find the variable loadings, in which the rows correspond to predictor variables and columns are associated with the components use trans.pca$rotation command.
The plot looks like following:
Here is the Video for more elaborations:
Reference:
Kuhn and Johnson (2013), Applied Predictive Modeling, Springer, New York