Missing Data Analysis Question

Hi gurus,

I really appreciate some advise on handling missing values.

Briefly, subjects rated 21 objects (obj 1 ...obj 21) on 14 attributes (attr1...attr14). Given the size of the questionnaire (21*14), we split the questionnaire into two parts:

N1 (about 150) subjects rated Obj1..Obj11 on all 14 attributes.
N2 (about 120) subjects rated Obj12...Obj 21 on all 14 attributes.

Both N1 and N2 come from the same population (based on Ch-squared tests on different demographic variables).

Our Goal: Perform Exploratory factor Analysis and obtain factor scores on all 21 Objects.

My questions are:

Q1. I basically need a complete correlation matrix (21 objects * 14 attributes). About 50% of the data is missing. What kind of imputation should I use to generate the other 50% of missing data?

Q2. Can you recommend a text or journal article which discusses modern missing data analysis procedures - I'm OK with linear algebra and basic stat concepts and can handle notation?

Q3. Can you point me to a programming manual/tutorial for SPSS?

thanks in advance,



Phineas Packard
I think I may be the bearer of bad news here. I dont think you will be able to impute data here as there is no overlap at all between the two groups and hence no missing data model that could be built. Missing by design is common but not the way you have it. Typically there is always some overlap that can be used to build a missing data model (see Craig Enders, 2010 Applied Missing Data Analysis book).
Thanks for the response.

I do have some common data (about 8 items on demographics) across the two groups. Will that help?

Last edited:


Phineas Packard
It is better than nothing (how much better than nothing will depend on how strongly associated those demographics are to the ratings of the objects). I think you could give it a go and see if you are successful but check iteration plots (for means and standard deviations) and convergence carefully to see if the results are sensible. Even if the results are ok you may have to be prepared through for extremely large standard errors given the large amount of uncertainty there will be in your missing data model.

What I would think about doing if you have a chance is to collect a third sample that rated objects 5 to 15 or something thus giving you the overlap you need.
Thanks again Lazar,

On reflection, we plan to collect new data.

Based on the "missing by design" phrase in your comment, I started searching for published papers which deal with split-questionnaire designs (SQD) and found some. I am trying to download and read some of the papers.

I came across a specific planned missing design called 3-form design (Graham, 2006). If you have some insights, I would appreciate if you can share them.



Less is more. Stay pure. Stay poor.
Were the two original samples large enough to find statistically significant difference between them for the demographic? That always seems to be a point of concern with sample sizes in regards to showing reflective samples when you don't want significance, is there power.

However, the listed plan seems like a plausible solution to attempt to bridge the two set and answer some questions.

The sample sizes were 150 and 120 for the two samples - good enough to conclude that they were not significantly different.

The issue is that the data was collected in 2007 and deals with perceptions/attitudes which change quite a bit. If I collected additional data on a subset of objects overlapping the two samples as Lazar suggested, I guess I will have to make a strong assumption that partial correlations are the same in 2012, even if means have changed.

On more pragmatic grounds, the set up cost for conducting a survey is high, but marginal cost/subject is not really high. Thus, I thought I might as well collect a new sample. Still planning..