# Thread: Interpreting and applying results PCA

1. ## Interpreting and applying results PCA

I have a problem with executing the methodology I use in my paper (Master thesis at the University of Amsterdam). I have the following data:
- Returns of seventeen countries per year (1993 untill 2010)
- Returns of investable indices (about 1994 till 2010)
The methodology is as follows.
A covariance matrix for the returns of the seventeen countries has to be constructed for every year in the data sample. The data sample reaches from 1993 till 2009, because the investable indices do not originate earlier than 1994.
Using the covariance matrices the eigenvectors can be calculated. The eigenvectors are used to calculate the principal components of the next years returns. For example, the eigenvectors of 1994 are used to calculate the principal components of the returns of 1995. This calculation will be done for the entire time horizon, which delivers 16 years with principal components. The amount of principal components which will be retained should be about ten (according to the paper I’m replicating).
This is a principal component analysis, which I have done using SPSS (so not walking through the steps mentioned before, but just using the method provided by SPSS). But I have problems to understand and to use the output of the principal component analysis.
The multi-factor model will be regressed on the investable indices, which are described in the data section. The explanatory variables will be constructed out of the principal components. Every investable index will be regressed yearly, which will result in sixteen regression with a R-squared as output. The R-squared will be the measure for market integration.
Now I have the output for the years 1993 until 2009 and the output looks a bit like this .
I have a Rotated Component Matrix, with a RAW section and a SCALED section, with ten components (which I retained).
Rescaled
Component
1 2 3 4 5 6 7 8 9 10
AUSTRALIA ,059 ,108 ,931 ,038 ,164 ,087 ,072 -,014 -,022 ,193
AUSTRIA ,785 -,099 -,026 -,099 -,003 ,094 ,134 ,123 ,026 -,003
BELGIUM ,784 ,007 -,035 ,062 ,026 ,062 ,095 ,123 ,194 ,056
S&P/TSX COMPOSITE INDEX ,115 ,845 ,200 ,063 -,007 -,014 -,029 -,011 -,032 ,121
MSCI DENMARK ,221 ,012 -,014 -,010 ,029 ,030 ,142 ,962 ,013 -,034
FRANCE ,561 ,274 -,067 ,132 ,079 ,073 -,041 ,038 ,650 ,082
DAX30 ,743 ,012 ,300 ,111 -,061 -,088 ,071 -,005 ,359 ,191
HANG SENG ,181 ,095 ,255 ,056 ,333 ,064 ,027 -,045 ,064 ,876
IRELAND ,724 ,324 ,154 ,080 ,070 ,211 ,153 ,059 -,263 -,022
ITALY ,181 ,042 ,040 ,976 ,006 -,011 -,003 -,011 ,086 ,042
TOPIX ,173 -,008 ,070 -,002 ,083 ,010 ,965 ,140 -,007 ,023
NETHERLANDS ,704 ,213 ,091 ,220 ,207 ,042 ,008 ,113 ,329 ,111
SINGAPORE ,151 ,072 ,169 ,006 ,915 -,025 ,095 ,035 ,060 ,269
SOUTH AFRICA-DS MARKET ,175 -,073 ,076 -,010 -,019 ,970 ,009 ,029 ,045 ,046
SWITZ ,801 ,145 -,017 ,152 ,140 ,059 -,039 ,029 -,004 ,078
UK ,463 ,479 -,005 ,181 ,159 ,168 ,045 -,049 ,416 ,007
S&P 500 COMPOSITE ,029 ,798 -,059 -,028 ,059 -,082 ,008 ,027 ,142 -,024

But I have no Idea how to interpret this results and how to apply these on the original returns in such a way to do the multi-factor model. I also have different outputs if needed.
But my question is, how do I apply these results to construct the explanatory variables I need for my multi-factor model.
Do I simply construct the components by multiplying the numbers with returns of the countries in that specific year?

I hope somebody can help me since my supervisor is not giving me any help in this.
Your effort is really appreciated.

2. ## 16 pca?

Hi bterstege,

At first glance I think that a better approach for this problem would be using an analysis technique called STATIS DUAL. That would reduce all your indexes, for all countries, for all years in a single matrix. But that may be more complicated

I have some questions regarding your PCA analysis. You are calculating the Principal Components from the Covariance Matrix. This is not common since this approach will give more importance to those indexes that fluctuate the most, that is the unstable ones. I've seen some economic studies with that objective, but I'd like to know is that's your case. Most PCA applications will be based in the correlation matrix instead. Now, If I understood well, you are trying to perform a regression model using the indexes returns as response variable and the principal components of return of countries as regressors. And you do that for every year? I probably would need to see your dataset in order to provide some further assistance, since I cannot see exactly what you are doing.

The rotated components are used to obtain a cleaner interpretation, keep in mind that PCA creates indexes and the eigenvectors let you know how each variable contributed to those indexes. You can use the scores as predictors in a model, that is a common practice.

Anyway, I would need to see your dataset first, at least an extract from it.

3. First, thanks for the reply.

Second I used the covariance method, because I'm replicating a paper from the Journal of Finance and the authors of this paper use the Covariance method also. I can maybe re-consider this, but I need to have a good explanation to alter the methodology.
Third, your interpretation of the regression is correct. The principal components are the independent variables and the returns of the investable indices are the dependent variable.

I will post the data-sets I have right now.

PCA_output.xls = The output I created trying to perform a PCA.
pca_returns_per year.xls = Returns of the seventeen countries, which have to be used in the PCA.
S&P_(large&mid&small).xls = An example of the dependent variable.
(I eliminated some years to reduce the file-size, but the principal remains the same)

The last part about the scores needs a bit of clarification for me, maybe you can clarify it with a part of my data (/results).

Your help is really appreciated.

Bas

4. Hi again bterstege,

I'm glad to be of assistance.

Now that I have some better idea of what you are doing, let me tell you that your process seems correct so far, considering that you would fit one model for each year (you could also fit a single model that include the information of every year, but that would be harder. Maybe for a future project)

The output you uploaded does not include the component scores (at least I couldn't see them). I'd recommend you using these scores in the models as your explanatory variables. The scores are the projections that the retained components create of the original variables. If you think of PCA as a way to create indexes, the scores are the value that each case (in your case each day) got in that index. There is an option in SPSS that lets you save these as a new variable.

You can have some information regarding Factor Analysis and PCA in SPSS in the following link:

Code:
``http://faculty.chass.ncsu.edu/garson/PA765/factor.htm``
There you can find some good examples and a better explanation of the terms than what I may be able to offer. Of course, you can see how to obtain and interpret the scores. With these you can easily fit the model.

Now, about the use of the covariance matrix, if you have some backup for the use of it, you probably may not want to change it. In the above website you can read some of the effects that this matrix has in the analysis (It's on the FAQ). As a comment, choosing between the correlation or the covariance matrix depends in the scale and the variance of your variables, so it requires some deeper analysis. You probably want to avoid that since that was probably done in the last study and they decided for the covariance matrix. It is easy to change the matrix you use in SPSS, so you can try changing it if you are curious .

If you need any further help, please feel free to ask.

5. I think I managed to do a proper PCA with the creation of new variables. I attached the output for one year (2004) and the created variables (2004 as well) and I think they are correct now.

At the bottom of the output-XLS, the factor scores are displayed.
Now I have to do a lot of PCA's and after that a great deal of multiple regressions. But I think the multiple regressions will be no problem for me.

I really would like to thank you for your assistance, since I think I've managed my difficulties.

6. I still have one problem left.

When reading the methodology I have to replicate, I have to use the factor scores of this year to multiple with next years returns.

When I save the factor scores in SPSS, it saves it for its own year.

First I tried to replicate the factorscores SPSS created, by just multiple the factor loadings with the returns on a certain date (but I do reckon that this is not the appropriate way, because I can't get even close to the factors scores).

The problem is that I do something wrong constructing the factors.
Any ideas what I am doing wrong?

7. I just found out that I first have to standardize the returns in order to do multiple with the factor scores.

Then I do come kind of close, but I have read somewhere that it is created by the handling of missing values.

8. Hi again bterstege,

The scores can indeed be affected by the missing data. Most likely SPSS will use 'listwise deletion' method on missing values, that is, exclude them from the analysis. Review your paper to verify if that is what was done there. Maybe they recurred to some imputation procedure (there are some simple ones, such as replacing missing data with the average values).

Good luck

9. Hello Terzi,
First of all, I have noticed from your answer to bterstege that most of applications are using Correlation matrix (PCA2) instead of Covariance Matrix (PCA1). Would you please tell me why statistically?!
My main question is this, why is the use of KDE in PCA, is it only because it shows the outliers so i can exclude these objects from the PCA to get a more accurate results (provide a graphical answer)?. And also i have search for this much, it seems that it is applied to PCs, why? is this for this reason?
what about if the data is not normal and contains a lot of outliers?
thanks so much

10. Originally Posted by mostafa.salama
Hello Terzi,
First of all, I have noticed from your answer to bterstege that most of applications are using Correlation matrix (PCA2) instead of Covariance Matrix (PCA1). Would you please tell me why statistically?!
I'll try to be clear with this, I hope to succeed (you can complain if I don't In PCA you decompose the covariance/correlation matrix using a singular value decomposition. This process can be affected by variables with a high variation, since this variables will receive more attention from the PCA which may cause results that give too much importance to a few variables and somehow ignore those with a lower variance. Even worse, if you are measuring different things (so you have different scales) your PCA will ignore those low numbers. An example could be the following:

Let's say I measured five variables from different cities in order to study life quality:

Percentage of citizens with access to electricity
Percentage of citizens with access to high elementary education
Percentage of citizens with access to clean water
Average income per person per year
Life expectancy

The first three measures are percentages so the values can range from 0 to 100. So, the variance can be somewhere around 20, 30 or something like it. On the other hand, the average income can be measured in dollars, so those quantities can be 10,000 or 20,000 or so. The SD in this case will be measured in thousands. Life expectancy may have the smaller variance of only 10, or 15 in the standard deviation. If you wanted to use a PCA here using the covariance matrix your results would tend to emphasize the effects of income, since the variance is the greatest. Percentages and life expectancy would be somehow ignored.

Well, the reason for using the correlation matrix is because that is exactly the same as using the standardized variables. In that case, the variance will be reduced to a form that can be compared with the PCA and none of the variables will be overrepresented. Since most studies include different scales, using the correlation matrix is more common.

In briefing, if you have all the variables measured in the same scale and the variance is similar among them, you can go with the covariance matrix (it has some theoretical advantages when doing inference). Otherwise, the correlation matrix is your choice. Of course, there are certain exceptions, although those should be theoretically sustained.

Originally Posted by mostafa.salama
My main question is this, why is the use of KDE in PCA, is it only because it shows the outliers so i can exclude these objects from the PCA to get a more accurate results (provide a graphical answer)?. And also i have search for this much, it seems that it is applied to PCs, why? is this for this reason?
what about if the data is not normal and contains a lot of outliers?
thanks so much
Normality is not an assumption for PCA, not even multivariate normality. If you can assume that multivariate distribution, you can perform some inference on the eigen vectors and eigenvalues but the main analysis does not require it. Outliers can be tricky, but not necessarily in PCA. In fact, PCA is a recommended tool to detect some multivariate outliers. Yet, a high number of outliers can become a problem. For that kind of issues, some non parametric alternatives to PCA have been proposed. Using Kernels is one of them. Although is still a proposal, this techniques may result in more efficient and robust estimations.

I really hope this shed some light in your doubts.

11. ## Re: Interpreting and applying results PCA

Hi all

this is my first time on this forum and I'm using this opportunity to say this thread is extremely useful for me.

I would appreciate if terzi can explain further on how to fit a single model that include the information of every year as my current project is related to this.

thanks

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts