# Thread: Analysis of elections data?

1. HI SAJ and Merry Christmas.

As for building the contingency table(s):
I do not know what type of data you have (sorry, but I do not remember the details of the files you posted).
If you have presence/absence data, you can use "0" when there is no match, "1" when there is a match.

May be that frequency figures should be good for your data: I understand that political data would contain frequencies of votes from, e.g., age category, *** category, etc.
In other words, I guess that you could have a table that indicates that, say, Party A received 325 votes from male, 400 from female, 200 from graduate person etc.

As for missing cases, I believe that it could be sensible to build up a contingency table with all the data you have, and then proceed to "polish" the table in following steps. At least, it is the way I work.
It would be nice if you could build the table with Excel (or, at least, to export the data into Excel for ease of handling).

The same holds true for the parties: first, include all the parties you have, and then we will select only the ones you are interested (or it could be interesting to sum up the minor ones).

Regards,
Gm

2. Right, got it. In the July poll I do not see a problem keeping the parties, instead of 5 there will be 8, but in the April poll instead of 5 there will be 20. (Interestingly the electorate consolidated their votes in the new elections to the parties they thought had a chance to enter parliament). The contingency table will thus become a very tall one.

On polishing data later: you do not think there would be a problem (for the CA analysis) if the total number of cases would be, lets say, 1,000 and we have e.g. 500 men and 500 women, but 930 members of ethnic groups?

Note: I see know from your example that the cases within "age_class" is actually twice as many as for men and women taken together. It looks a bit strange, but I think I have answered my own question: for CA it does not really matter how the variables are related within the cases. The thing to do then is to, in the article text, stipulate exactly what data in the table includes.

Yes, I will see that the table is in Excel. I have also started to have a look at PAST so I will see what I can do with that program as well.

SAJ

3. Hi again!

I think it would be easier for me to inspect your contingency table and then get a general and deeper idea of the data from it, and see what kind of problem (if any) may exist (and working out a way to circumvent them).

If you want, you can attach the Excel file here or, if you think it is better, send me as private message.

Regards,
Gm

4. ## April 2009

Gm,

Attached you may find the April 2009 data. I think the table is rather self-explicable. Parties in rows, variables in columns.

The only thing that might seem a bit strange is the localities. In Moldova there is a difference between city and municipality, where the latter are considered bigger cities. In practice this means that there are three municipalities in the country, rest are considered villages and cities. In the July poll, municipalities are included but with a further distinction of "municipality", "city with more than 15k inhabitants", "city with less than 15k inhabitants" and rural. Another variable in the July poll only distinguish between "rural" and "urban", which probably makes more sense.

Thanks.

SAJ

5. Hi Saj,
I gave a look to the excel file and it seems good!!

Please, let me work on it for a while. I believe that it is necessary to scrutinize the literature on CA since I see that the variables are in some way related to each other. Sorry if my English does not express well my mind.
In other words, for example, the group men-women (totalling 5991) is "nested" into the age groups (totalling again 5991) an so forth. So, may be this issue need a different approach. I have to think on it more and may be we could have the need to ask other member for some advice.
Or, at least, we could take into account groups of variables separately. But I guess that for you it could be more interesting to explore the cross-relation between variable groups (that is, e.g., to know what ethnic group "and" what age class "and" what gender voted for a given party). But I fear that to accomplish this latter type of analysis, the contingency table has to be built in a different way. But I have to think more about this issue.

I tried, as working hypothesis, to take into account for a while only the ethnic groups and the results seems rather interesting.

If you want, I could give you a first preliminary comment on the situation relating to the ethnic groups. I am unsure if I can do this today, since today is my birthday and I give a party this afternoon.
But I will contact you as soon as possible.

Let me know what you think about.

Kind Regards,
Gm

6. So, Happy Birthday it is then!

Np, take the time you need. This is a rather long time work of mine and feel no rush. However, if it would turn out that CA is suitable for this kind of analysis then it would also solve some other matters related to my dissertation.

Yes, you are correct. That was also my point, the analysis would be very interesting if possible to see what kind of variable that may be grouped around a specific party.

I have tried PAST and gotten some results out of it, but unsure on how solid my results are. According to what I know of the empirics it seems to be ok.

Looking forward to see what results you may get later when you have the time.

SAJ

7. ## Good news and bad news

Hi SAJ,
sorry for the delay but I had to think a bit about the dataset.

Do not be scared by the title of this reply!!

As I wrote in my previous post, the dataset you posted is made up of more than one variable. That is, Correspondence Analysis may analyse two-way contingency table where two categorical variables are taken into account.
So, as far as we want to understand the nature of the relationship between Parties and, e.g., nationality, we can use CA. We can use it also in the case we have three variables. There are some "tricks that allow us to have more than two variables in a contingency table" (quoting the post of Terzi).

In order to achieve this, the three (or more variables) have to be recoded and the table rebuilt in some different way (that are cleverly summarized by Greenacre (reference quoted in my previous post)).
In essence, it has to be resumed the relation between, e.g., nationality and gender or age_class.

Good news.
I tried to analyse the table as far as nationality is concerned. See attached PDF (I attach the "worked" Excel file as well).

As you can see (page 1; output from Minitab), the CA provides a good representation of the "variation" (inertia) present in your data.
The first 2 axes account for about the 92% of the total inertia, with the 1 axis accounting for the great part of it (about 73%). This means that the majority of the inertia is explained by the first (orizontal) axis.

On page 2 you can see the data-set, followed by two tables providing the rows and columns profiles. The hightlighted cells contain the values (percentage) that are greater than the corresponding row/column average.
For example, as far as row profiles are concerned, the first Party (recorded as P_1) as a proportion of Gagauzian (15%) that turns up to be greater than the average (3,4%).
The same applies to the column profiles' table.

These tables inform you about what profile (row and/or column) is above or below (or same as) the average.

Page 3 contains the scatterplots of row and column profiles. The two graphs are separately displayed. These are from PAST.

As for the second graph, it is clear that CA is essentially opposing two broad groups in relation to the first axis. One to the right; one to the left.
It is also clear that the groups to the left can be further divided into two sub-groups: the one laying to the top, while the other is to the bottom of the graph.

As you can see, the Moldova profile-point is closer to the centre than all the other points. This is due to the fact that the Moldova profile is the closest to the average (the centre of the graphs indicates, in fact, the average rows/columns profile). You can easily see this from the Tables with the profiles values.
In other words, the Parties are little different as far as the Moldovan votes are concerned; by the same token, the Moldovan vote are nearly "evenly" (grossly speaking) distributed across the parties.

As for the relation between parties and nationality as stemming from the CA, the more a party lays to the right, the more it has a higher relative proportion of Moldovan and Romanian voters.
The more a party lays to the left, the more it has vote from the nationality laying in the same space. But remember that, as far as this second group of parties is concerned, the more a party will lay in the higher area of the plot, the more votes it will have from Gagauzian and Bulgarian. The more down, the more votes from Ukrainan, Russian and other Nationality.

Page 4 contains a "worked" table: the original table has been sorted on the basis of the CA's scores and two broad groups have been devised (A-B). Group A has been further divided in two sub-groups (A1, A2).
In essence, the table try to reflect the groupings devised by CA.

The little table at the bottom of the same page summarizes the data of the major table.
As you can see from the bottom table, the Nationality of different colour are those featuring the different groups of parties.

For ease of reading, page 5 provides two tree diagram (after the cluster analysis on the CA's scores) indicating in a more rigid way the groupings.

All this is meant in an Exploratory perspective. You may add the type of graphical representation you want for ease of representation.

As far as the testing of the statistical validity of the groupings is concerned, this is a question on which I would like to have comments from other members of the forum.

I am going to read a chapter of Greenacre book dealing with this topic. I will let you know if any idea comes up.

I hope this helps.

If you want and if you manage to recode the data in the way I said before, we could go further with this analysis.

Let me know.

Regards,
Gm

8. Np, your title did not scare me Thanks again for the time you are putting down, especially the explanations are very helpful.

The results you got match very well my understanding of Moldovan politics. Your results on the Moldovan group is also more interesting since it nuances the picture you get by simply running percentages. But grouping the nationalities as CA can do makes it also easier to grasp and understand.

Greenacre's book seems very interesting. At the moment I'm residing in Bucharest and while I did bring some stat books with me, none of them deals with CA (or any other of the names is goes by). I was able to find Greenacre on Google books but could just read parts of it. I'll try to see whether its possible to pick it up somewhere around the city.

The recoding you are proposing would not that just be a table that organises the variables according to below scheme?

urban rural
men/women men/women
ethnic group 1..x ethnic group 1...x

Something like that?

To make this exercise useful I need indeed to go further and also check the other variables. I could of course run new CA:s but that would only provide answers how the variables are related and not how strongly attached they are. Any ideas how such operation could be carried out?

SAJ

9. ## recoding

Hi SAJ.

I am happy to know that my tentative analysis turned up to be interesting to you .

I have not read the Greenacre's chapter yet. I will inform you later if any idea comes up (about, i.e., testing the significance of clusterings in CA's results).

As for the recording in case of three or more variables, you could read in Googlebook the pages of Greenacre's chapter on "stacked tables".

In any case, I attach 2 pictures from Google.

Pict.1 shows an example of recoding, when three variables are taken into account. I think it is rather self-explaining.

Pict.2 shows an example of recoding when one is dealing with more than three variables. This is an example of stacked table.

Try to figure out how to adapt your data in the light of these examples.

It has to be noted, however, that as long as the number of variables increases, the interpretation of CA's results becomes a little trickier. Nonetheless, Greeneacre provides several guidelines to this kind of situation and to the interpretation of its results.

Hope this helps.

Regards,
Gm

10. Hi,

Yes, that is pretty much as I pictured it. The Gender-Age table is clear. It combines variables and group them.

Regarding the stacked table it would seem that variables are run separately, but are they are combined in the CA or just analysed as separate?

It would of course be possible for me to simply run the variables separately but then I would also have to be able to tell something about the strength of the relationships.

SAJ

11. Hi!

As for the coded age/nationality, the coded table is analised by means of CA but the interpretation process has to be done in a slightly different perspective.

As for the stacked table, it is analised by CA as well, taking into account all the variables at the same time. In this case as well, the interpretation differs a little.

Let me know if you want an help in running CA and intepreting results, or if you need general help for some reason.

Regards,
Gm

p.s.
I will inform about the inferential factes of the CA (I am studying the issue at the moment)

12. Hi again,

But is the stacked table then any different from what I provided before and you worked with? How does the stacked table take into account that variablesalso may be related?

13. Hi SAJ,

the table you provided me (and on which I worked) does not inform us, e.g., how many person aged 18_29 were Moldovan, how may Russian, etc.

By the same token, we do not know how many Moldovan were male and how many were female, etc.

So we can:

1) use the one table for each group of three variables: let's say, one exploring the relation between Parties, Age and Nationality; one exploring the relation between Parties, Gender and Nationality; an so forth.
In this instance we only need one recoded table (of the type of Pict.1 of my preceding post) for each type of analysis.

2) use stacked table (of the type of pict.2 of my preceding post).
Your are right in your doubt. The analysis with the type of staked table attached here will reveal the interaction between Nationality and Parties, and Age classes, and Gender, BUT NOT between Parties, Age classes and Gender. Or this table could be reorganized in order to analyse the relation between Parties (putting them in columns) and Nationality (switching them in rows), and Age classes, and Gender, etc.

I think these are the better options for your type of analysis.

Regards,
Gm

14. ## Happy New Year!

Gm,

I think the second alternative will also be the best one, i.e. a stacked table where each variable is run against the parties. I'm of course grateful for all future help you can provide but I would also need to run the analysis myself so that I can repeat it later if I would need to

I was also thinking of changing rows and columns so that I have transposed (both in Excel and Past).

I picture the process in the following way:

1) Running CA on the stacked table in order to see if there are other variables that also correspond to parties. I would presume locality and age to have some effect, gender less so.

3) Are there any possibilities to see whether a relation is stronger or weaker? I have understood it as a no, but on the same time ending up at the origo, as the Moldovan group tended in your previous run, indicates a result close to the average. May such a result also be the cause of a larger sample or the size of the group does not matter?

4) Tables and digrams to illustrate the results but lets come back to that later.

Question is, then, in what end to begin? Would Past be enough? On my computer I have SPSS, Excel, now also Past. If you just "push" me in the right direction here, I will know where to start my next investigations.

I do have rather good hopes that the analysis will show up pretty interesting and that it should be possible to publish it somewhere. Be sure that I will recognise your invaluable help somewhere in the text

Happy New Year!

SAJ

15. Hi SAJ,
nice to hear you again and to know that you are managing to find the right path to your analysis.

I hope that you will manage to get interesting results. I am happy to have helped a bit

1) I think it is good. I understand that you have read the warnings about the interpretation of CA on stacked tables. But, from what you write, it seems so.

3) I wrote in one preceding post of mine that the issue of assessing the statistical significance of CA clusterings (or, generally speaking, CA results) is a complex one.
I have read Greenacre's Chapter 15, where he pointed out some hints: someones are easy to perform, others are difficult and require specific software. I have write to Prof. Greenacre himself to ask for some advices, and I am waiting for his reply. I wrote to the Past user Forum as well as to this forum, but I did not received any reply until now.

As far as the statistical significance of the division, e.g., of the Parties into two broad groups, I would act in the following way (but please note that I am not so sure about): I would perform a non-parametric test (e.g., Mann-Whitney) to test the significance of the median difference in vote between the two broad groups. The same (I guess) could be performed in relation to the Nationality (to keep with the dataset I worked on).

More "orthodox" ways (I found them in Greenacre book) are:

A) to perform chi-square test on the contingency table, to see if there is a significant association between rows and columns.
This can be easily performed with the CA's results and Excel.
Steps:
-take CA's results and get the total inertia (it is present in the output analysis or you can just sum up the inertia of the various axes as provided by Past output window)
-multiply this total inertia by the sample size (in our case, the table's grand total)
-so you get the chi-square value for your table
-go in Excel and use the function DISTRIB.CHI() and inside the parentheses put the chi-square value, then ";" and then the degree of freedom of your table. The latter is equal to the (number of row-1)*(number of columns-1).
-you get the probability associated to your chi-square values.
NOTE: Instead of using Excel, may be you can use any statpack by analysing the table.

B) to perform a similar analysis applied to the relevant axis:
-Take the inertia explained by the first axis
-multiply this by the table's grand total in order to get the chi-square contribution of this axis
-test this values referring to the table here attached
-if your values is greater than the corresponding value in the table, then that dimension is significant (assuming that your data are statistical valid [e.g., from random sampling]), that is there is less the 5% of possibility that it has arisen by chance.

To be sincere, I am unsure about the effect of different sample size on CA results. What I can say is that Moldovan profile is near the average, that is its "distribution" does not differ a lot between profiles.

4) When you will have to present your analysis, I think that you could start from the original table, and then perform the CA providing the scatterplot. Then you could wish to sort the table(s) according to CA results and provide some descriptive graphs (histograms ?) of the groupings you devise.
It could be nice to facilitate (along the scatterplot) the eyeballing of groupings by means of dendrograms of the cluster analysis on rows/columns scores on the relevant axes.
On this latter topics, I attach an interesting article found on the web. I also attach a PDF that explain the cluster analysis (it is from Minitab Guide, but I think it can be useful anyway).

As for program, I use various program, since each one has its own strong points (SPSS, but mainly PAST, MiniTab, SigmaPlot11).

So, I think it all.
I hope this can help and that this quite long reply does not confuse you.

I look forward to know about your results, and I hope that you will manage to do all by yourself. In any case, if you have any problem do not hesitate to contact me (here or privately [you can find my mail in my website]).

Good luck and happy new year,
Kind regards
Ciao!!

Gm