Analysis of elections data?

SAJ

New Member
#1
Dear All,

I am currently writing an article on two parliamentary elections taking place in one country during 2009. Data consists of two exit-poll samples. It is an interesting case and could provide some interesting insights in relation to the political framework.

What I would like do to is to test if ethnicity can or cannot be used to explain voters' choice of political party (this being taken for granted in many of the old Pol Sci works). The first thing I would like to do is to check if ethnicity can be used to predict which party a respondent likely would prefer. The second thing I would like to investigate further is if other variable may have a higher level of influence on voters' choice?

The dependent variable would thus be "choice of political party" and independent variable "ethnicity", "gender", "profession", "locality (city/town/rural)", "age group", and "education group".

I do have some experience of statistics since earlier, but it has been some years and although I am reading through my old stat books I am a bit afraid that I will start in the wrong direction or choose a type of analysis that either does not do me any good or simply is wrong...

Any help that could start me in the right direction would be appreciated!

Thank you.
 

terzi

TS Contributor
#2
Hi Saj,

Based on your brief explanation, I would suggest some modeling techniques for your study. This could give you excellent results, specially if you are interested in predictions. My recommendation is a Logistic Regression Model. I don't know how many parties you have in your country, but you can use either a binary model or a a multinomial model. The sad part of this is that you may need a complex model to study this situation, since you talk about a survey that may (likely) be based on cluster sampling and/or strata. So, you must include the information of the sample design in the model, either using DEFF corrected standard erros or using a Hierarchical Model, which in my opinion is the best option but also the hardest. So, for modeling, you may need a hard statistical background. Still, the results would be awesome (I'd love to do something like that sometime:))

There are other options available, although you won't get results as interesting as those obtained from a model. One possible analysis is a Correspondence Analysis to study those relationships. This one is a bit easier and is also a valid scientific and statistical approach.

Good luck with your study, I'm sure you'll do some amazing work with that data
 

SAJ

New Member
#3
Re. Correspondence Analysis

Thanks terzi, your ideas were execellent and much appreciated!

The Logistic Regression model does sound as the best choice and I spent some time investigating it. However, I think it will have to wait for now. With my present knowledge of statistics it would take me too long to understand ;( (I will, however, try to dig a bit deeper in it for the analysis I need to do for my dissertation).

Correspondence Analysis (I was actually looking at Factor Analysis bit it did not seem right) thus seems to be the best choice and I hope that you here might have the possibility to aid me a bit further. I will write my questions to the best of my ability and hope it will not take too much of your time.

In SPSS I'm using the function for Multiple Correspondence Analysis. I will introduce the variables ***, educational group, age group, nationality, occupation, "for which party did you vote?", type of locality (urban/rural).

1. The first thing I would like to do is to reduce the number of parties that actually ended up winning seats in parliament (=6) and move these into a new variable.

2. May I place all variables within "analysis variables" when running the Multiple Correspondence analysis?

3. Do I need to change any of the options in "discretize", "missing", "options", "output", "save"?

4. Is there something in specific I should check in the output?

5. As I understand it the "Correlations Transformed Index" lets you see how variable relate to one another going from 0 to 1. E.g. a 0.387 on "ethnicity" when related to "for which party did you vote", is not especially strong but, in this case, might be interesting when compared to the other variables.

Thanks once more. With the help you've provided this far I am positive that the analysis of the exit-poll data will actually be quite interesting.
 

gianmarco

TS Contributor
#4
Hi!!
I was thinking about your problem and the use of CA...

Why are you considering to use Multiple CA? May be I am missing the point, but if you have built a contingency table up with, say, your parties in rows and "ethnicity", "gender", ..etc.., in columns, and you want to investigate the data structure and the relation between the choice of a party and other "variables", I think you should use Simple CA.

If this is the case, I suggest to give a first look to the results by means of a program like PAST (freeware; just search on Google) that is far less answering for details of any kind.

I hope this helps and, if you have more questions, do not hesitate to reply.

Best Regards,
Gm

p.s.
I have written down a brief primer to CA (with references): see http://xoomer.virgilio.it/gianmarco.alberti/index_file/Page714.htm
 

SAJ

New Member
#5
Thanks gianmarco,

I'm too novice at this so my answer would simply be: to put them up as simple CA might well work :) As I understand it then, this implies running separate simple CA's for each independent variable and see how they relate to the dependent?

I would love to have a look at that primer you are referring to. Unfortunately the link you provided seems to be broken. I had the same results from trying to google it.

Thanks also about the PAST tip, I will check it out later this week. Could you briefly just tell me the advantage of using PAST before SPSS?

Best regards,

Andreas
 

gianmarco

TS Contributor
#6
Hi!!

May be I did not understand your problem and/or the dataset you have: if you want, you can send it so I can give a look to it and give you better advices.

As for Past: I never used SPSS for CA so far, but I used Minitab, Past, Statistica, Systat. I find that Past is very simple to use: just select the columns and let the analysis go with a couple of clicks.

Please, could you tell me what kind of problem did you find with the link to my primer. I clicked on it and the browser opened the page containing the primer. So, I do not manage to figure out what the problem can be.

regards,
Gm
 

SAJ

New Member
#7
Datasets

Hi again,

I will try to see if I can get to your homepage from home. Might be that the page, somehow, is not possible to reach from work.

Sure, I attach the datasets from the Moldovan 2009 elections. Thanks for offering to take a look at them. Files are zipped (in order to be able to post here) and in .sav format. Text is in Romanian, but I judging from your name, my guess would be that you read it easy enough :)
 

terzi

TS Contributor
#8
Good night in Mexico:),

I would agree with gianmarco. Using simple correspondence analysis would be easier and it would lead to pretty interesting results. Multiple Correspondence Analysis is a little bit trickier since it is designed to explore relationships between homogeneous variables that are somehow related. You would also need a deeper knowledge on the subject, since there are two main approaches for MCA and you need to select one, so the interpretation may change a little... I would suggest you to skip that:)

Now, on another subject, the Simple Correspondence analysis is used to study relationships for a single contingency table, BUT there are some tricks that allow us to have more than two variables in a contingency table. For instance, using interactions will create contingency tables with more than two variables. You can interact gender (2 categories) and age (let's say 5 categories) to create a new variable that will have 10 categories and cross it with voting preference. Voila! You have a two-way contingency table with three variables. Of course, the more variables you introduce, the larger sample you require, but you can add as many as you want.

This way, you can make an interesting analysis that won't be so complicated. STILL, Simple Correspondence Analysis is not so simple, interpreting the biplots may be complicated, specially the symmetric biplots where distances between row and columns do not measure association. So, it will also require some research.
 

gianmarco

TS Contributor
#9
Hi SAJ,

sorry for the delay in replying, but I have been out for work and I had no web connection.

First, I agree with Terzi about MCA vs CA, and I also agree with him about the fact that the CA results are not so easy to be interpreted, at least at a first glance. I found very interesting and useful M. Greenacre, "Correspondence Analysis in Practice", 2008.

Secondly, I will give a look to your data and I will let you know. Could you please tell me what format .sav is? What program does open this format?

By the way, I am from Italy.

Regards,
Gm
 

gianmarco

TS Contributor
#10
Hi Saj,

I gave a look to your data. I got a general idea of your variables, but I have some doubts on some of them (by the way, where are the counts? Are they under the label CODUNIC?).

Could you give me more information about your dataset?

If you want, you can write a private message here in this forum.


Regards,
Gm
 

gianmarco

TS Contributor
#11
Hi SAJ!!

I propose a fictional example hoping to help you a bit with your issue.

Please, give a look to the attached JPG pict.

Let use suppose that you can build a contingency table which allows to cross-tabulate the parties you are interested in with some variables like gender, nationality, etc....

I limit the example to 5 parties.

Each table's cell contains a number that stand for the number of vote that each party received in relation to the type of gender, the type of nationality, etc.

By means of Correspondence Analysis (like stressed in previous posts) you can explore the relationship between parties and the various variables. You can inspect the table in search of groupings and (if they exist) you can have a general idea of what party is similar to others and which variable affects the similarity.
See scatterplot in the attached pict.

Additionally, you can sort the original contingency table on the basis of the score of data-points on the relevant axes.
See the second table in the attached file.

Additionally and for ease of analysis and/or visualisation, you could distinguish (to keep with my example) two broad groups (labelled A and B). See the third table in the attached file.

Once you have devised groupings of parties and variables, you can further perform the hypothesis tests you prefer to check the statistical significance of the difference detected.
See, for example, the chi-square test performed to check the difference in gender between the two groups of parties (see the bottom picture in the attached file).



I hope this can help.
Regards,
Gm
 

SAJ

New Member
#13
Concerning the jpg

Thanks gianmarco, I appreciate the time you are taking to explain this to me. I have had a look at the jpg file you provided and also tried to understand a bit more how CA works.

The contingency table is clear to me. I also understand the basic chi2-logic that goes along with CA. Through CA we may see that variables are related but nothing on how strongly they are associated.

If I understand your second table correctly we may thus say that

gender2, age_class1 and age_class2 are related to party 5 and party 2.

age_class3, nationality3, nationality2, gender1, age_class4 and nationality 1 are related to parties 3,1,4.

Questions:

how should axis 1 and 2 be interpreted? Axis 1 is the parties? What is then axis 2?

The two broad groups A and B, what do they really mean?

SAJ
 

gianmarco

TS Contributor
#14
Hi SAJ.

I am sorry not to have gone in deep into CA details. I attach a Primer to CA that I have written some time ago.
Since I am archaeologist, the Primer take into account some examples of archaeological interest, but the mechanic of CA remains the same and I believe that the Primer can be understood also by a non-archaeologist.

The Primer explains how to interpret the scatterplots and provides a minimum of bibliographical reference.
If you are interested in CA, I suggest to read the book of M. Greenacre (quoted in the Primer). Many website do exist explaining the basics of CA as well.

I gave a look again to the files you attached in your previous post, but I do not find the parties. So, if you could provide a contingency table (you can use Excel) (or if you could extract from your data a contingency table with partie in rows, variable in columns), I will be happy to help you with CA and its results.

As for CA and the strength of association(s), CA allows to explore the relation in your dataset, to devise groupings, to understand the relation between objects and variables; in essence, it allows to reduce the dimensions of your data and to facilitate interpretations. The strength of association(s) must be checked in a later stage of analysis, by means of the hypothesis tests that better fit the data and the hypothesis stemming from the exploration of the data-set.

When I was talking of the two broad groups, I was only making an example. It could be of interest (or it could be not, from your standpoint) to devise groups, differing from certain variables. It is only an example. May be that you could found interesting that group A (comprising parties x and y) is more related to man aged 24-50, whereas group B (comprising parties z and w) is more related to woman aged 50-60. Or that group A is more related to an ethnic groups than the other one. Etc etc.

As for your interpretation of the second table, you are right. The same interpretation should stem from the scatterplot. If you see it, the 1 axis (in my example) is mainly opposing Party 2 and 5 to 1 and 4.
However, the scatterplot interpretation, it is a delicate step. You find more info in the Primer.

I hope this can help, and I really hope that my hints are not leading you astray.
Any comment from other member is welcome.

Feel free to reply and to ask more.

regards,
Gm
 

SAJ

New Member
#15
Table

Thanks Gianmarco and Merry Christmas! :)

I will construct a contingency table and post it here - or perhaps even one for each elections. Just one question before I do that. In order to break up the different categories I will need to recode the variables. Correct me if I'm wrong but that would basically just be to put in a "1" for the characteristic I want to count, for nationality i.e. "Moldovan", "Romanian", "Russian" etc. respectively.

However, within some variables there will be missing cases (within age and ethnicity) and I, moreover, would like to include only parties that entered parliament (the other ones are very marginal). For the first categories cases are relatively few (age=1 missing; nationality=34 refusals) but for parties it will mean the diminishing of some 900 cases out of more than 19,000.

Am I going around it the right way when preparing for the contingency table? I cannot really see any other way to do it. Leaving the cases as they are it would mean that we would have all the variable for the other categories accounted for but there would e.g. be more cases for "men" and "women" then there would be for the different parties added together.

All the best,

SAJ
 

gianmarco

TS Contributor
#16
HI SAJ and Merry Christmas.

As for building the contingency table(s):
I do not know what type of data you have (sorry, but I do not remember the details of the files you posted).
If you have presence/absence data, you can use "0" when there is no match, "1" when there is a match.

May be that frequency figures should be good for your data: I understand that political data would contain frequencies of votes from, e.g., age category, *** category, etc.
In other words, I guess that you could have a table that indicates that, say, Party A received 325 votes from male, 400 from female, 200 from graduate person etc.

As for missing cases, I believe that it could be sensible to build up a contingency table with all the data you have, and then proceed to "polish" the table in following steps. At least, it is the way I work.
It would be nice if you could build the table with Excel (or, at least, to export the data into Excel for ease of handling).

The same holds true for the parties: first, include all the parties you have, and then we will select only the ones you are interested (or it could be interesting to sum up the minor ones).

Regards,
Gm
 

SAJ

New Member
#17
Right, got it. In the July poll I do not see a problem keeping the parties, instead of 5 there will be 8, but in the April poll instead of 5 there will be 20. (Interestingly the electorate consolidated their votes in the new elections to the parties they thought had a chance to enter parliament). The contingency table will thus become a very tall one.

On polishing data later: you do not think there would be a problem (for the CA analysis) if the total number of cases would be, lets say, 1,000 and we have e.g. 500 men and 500 women, but 930 members of ethnic groups?

Note: I see know from your example that the cases within "age_class" is actually twice as many as for men and women taken together. It looks a bit strange, but I think I have answered my own question: for CA it does not really matter how the variables are related within the cases. The thing to do then is to, in the article text, stipulate exactly what data in the table includes.

Yes, I will see that the table is in Excel. I have also started to have a look at PAST so I will see what I can do with that program as well.

SAJ
 

gianmarco

TS Contributor
#18
Hi again!

I think it would be easier for me to inspect your contingency table and then get a general and deeper idea of the data from it, and see what kind of problem (if any) may exist (and working out a way to circumvent them).

If you want, you can attach the Excel file here or, if you think it is better, send me as private message.


Regards,
Gm
 

SAJ

New Member
#19
April 2009

Gm,

Attached you may find the April 2009 data. I think the table is rather self-explicable. Parties in rows, variables in columns.

The only thing that might seem a bit strange is the localities. In Moldova there is a difference between city and municipality, where the latter are considered bigger cities. In practice this means that there are three municipalities in the country, rest are considered villages and cities. In the July poll, municipalities are included but with a further distinction of "municipality", "city with more than 15k inhabitants", "city with less than 15k inhabitants" and rural. Another variable in the July poll only distinguish between "rural" and "urban", which probably makes more sense.

I'll await your reply on what you think about this table before I start working on the July poll.

Thanks.

SAJ
 

gianmarco

TS Contributor
#20
Hi Saj,
I gave a look to the excel file and it seems good!!

Please, let me work on it for a while. I believe that it is necessary to scrutinize the literature on CA since I see that the variables are in some way related to each other. Sorry if my English does not express well my mind.
In other words, for example, the group men-women (totalling 5991) is "nested" into the age groups (totalling again 5991) an so forth. So, may be this issue need a different approach. I have to think on it more and may be we could have the need to ask other member for some advice.
Or, at least, we could take into account groups of variables separately. But I guess that for you it could be more interesting to explore the cross-relation between variable groups (that is, e.g., to know what ethnic group "and" what age class "and" what gender voted for a given party). But I fear that to accomplish this latter type of analysis, the contingency table has to be built in a different way. But I have to think more about this issue.

I tried, as working hypothesis, to take into account for a while only the ethnic groups and the results seems rather interesting.

If you want, I could give you a first preliminary comment on the situation relating to the ethnic groups. I am unsure if I can do this today, since today is my birthday and I give a party this afternoon.
But I will contact you as soon as possible.

Let me know what you think about.


Kind Regards,
Gm