All items missing for various questionnaires

#1
Hi,
this is actually the first time I'm working on a big dataset and I really hope someone can give me some advice on how to handle missing data. I tried to find information regarding my problem but can't find any blog with the same issue.

I'm working on a dataset including different questionnaires. Some questionnaires measure the participants level of anxiety, depression etc and some questionnaires measure the same participants perception of compassion within their organisation. The questionnaires are ProQOL, SSPS, COQ etc and are likert type scale.
The questionnaires were filled in by the same participants which is the workforce in the organisation but on different days. I have incomplete data due to different working days/hours. For example participant 1 filled in questionnaire 1, 4, 5 but not 3 and 2. Participant 2 filled in 2, 3, 4, 5 but not one and so forth. The items within the questionnaires are completely missing so no respondent responded to item several items within one questionnaire. Some questionnaires aren't responded by 51% of participants. Can anyone help me what would be the best approach in this situation?

I read a lot about MCAR, MAR and MNAR and would say the data is MCAR as far as I understand. Please correct me if I'm wrong. I also calculated the randomness in SPSS. I therefore went to Analyze -> Missing values analysis and opted t-tests with groups formed by indicator variables under the descriptives option and clicked EM. I thought that might be a good idea. As I said, I'm completely new to this and would really appreciate some help.
Thank you in advance.
 

Karabiner

TS Contributor
#2
Some questionnaires aren't responded by 51% of participants. Can anyone help me what would be the best approach in this situation
It depends what you want to do with your data, what you want to use them for, what the research questions are.

With kind regards

Karabiner
 
#4
For example participant 1 filled in questionnaire 1, 4, 5 but not 3 and 2. Participant 2 filled in 2, 3, 4, 5 but not one and so forth. The items within the questionnaires are completely missing so no respondent responded to item several items within one questionnaire. Some questionnaires aren't responded by 51% of participants. Can anyone help me what would be the best approach in this situation?

I read a lot about MCAR, MAR and MNAR and would say the data is MCAR as far as I understand. Please correct me if I'm wrong. I also calculated the randomness in SPSS. I therefore went to Analyze -> Missing values analysis and opted t-tests with groups formed by indicator variables under the descriptives option and clicked EM. I thought that might be a good idea. As I said, I'm completely new to this and would really appreciate some help.
Thank you in advance.
I think that there's a lot to unpack here, since you didn't give much information as to what you are planning on doing with the data. As such, I'll just have to speak generally about your situation/data as a whole, and I'm assuming that you will be working with 3+ variables at a time (multivariate statistical analysis).

The way that I would approach this (and not saying that this is the quickest or prettiest way) is by subsetting my variables based upon participants and questionnaires of interest. This is to ensure that whatever association/analyses I conduct, missing data is omitted. Sort of like the data that goes into Multiple Correspondence Analysis (MCA), your variables can be the questionnaires themselves, but those variables have categories (the questions on those questionnaires). If a participant didn't complete a questionnaire, their data can't really be considered in any analyses using that questionnaire, as all of the categorical data within that variable (questionnaire) would be blank. Thus, your sample size would just be decreased upon omission. You would then have to do your own comparison on the effects of other variables before and after missing data omission to determine whether or not it's significant. If the changes are significantly different in some variables, then you'd have to ask yourself whether or not those changes are relevant to your study.

If you did still want to fill in missing data (again, I would not suggest this), then you're really limited in what you can do. Pairwise Deletion, bivariate correlation is estimated on all dataavailable for each successive pair of study variables, really only performs well with MCAR data with a large sample size and with less than or equal to 5% of the data missing. The problem then if you decide to use your Pairwise-Deleted data is that most multivariate analyses will not run on it. If you are interested as to why, Pairwise Deletion has the tendency, much greater than Likewise Deletion (another technique dealing with multivariate missing data), to result in a non-positive determinate of the variance-covariance matrix.

The most promise for multivariate missing data lies in Multivariate Imputation (MVI). I'd look into it if you really can't eliminate your missing data. Disclaimer, it is sensitive to missing data bias and high measurement error, increasingly so with smaller sample sizes. In this case, you input sample size would be the number of people WITH data, with the intention that MVI will output the remainder of your data.

Again, just because this method shows the most promise, that does NOT mean that you should use it here. With 51% of your data missing for some of you questionnaires/variables (and thus "x" number of categories within that variable), I would suggest data elimination and analyzing the differences between data before and after missing data elimination! Hope I was able to help in SOME way!
 
#5
It depends what you want to do with your data, what you want to use them for, what the research questions are.

With kind regards

Karabiner
Hi Karabiner,

I responded yesterday but it seems like my internet broke down and it wasn't posted. Thank you for your response. I wanted to do a Cluster Analysis and a Factor Analysis. Some clinical psychologists worked on the data before and conducted a multiple regression analysis and a factor analysis only using some variables. My research question will be formulated after I get a better understanding of the data. The cluster analysis and factor analysis would be useful to find out where to allocate resources and what questions measure the main important factors for compassion within the organisation and of the workforce which is preventative for symptoms of burnout, compassion fatigue etc. Thank you for taking the time and please let me know if you need any more information.
 
#6
There are a wide range of methods such as multiple imputation to handle missing data.
Hi Noetsy,

thank you for your answer. I looked the possible methods up and read literature to fid out which one would be most suitable but my case is very specific and I'm not sure its appropriate as either all items within the questionnaire are missing or none while other questionnaires are responded by the same participants. Also, over 50% of my data is missing in total.
 

noetsi

Fortran must die
#7
50 percent is a lot. But polls, increasingly, have very low response rates and are still used in the academic literature. Whether this is reasonable is beyond my expertise.

Part of the issue on how reasonable it is to use such approaches is whether the people who did not respond differ systematically from those that did respond. Whether the data is MAR, MCAR and so on. Multivariate Imputations works best of the methods that have missing data I think, but if the reason its missing is tied to something about the customers who did not respond, you are going to have problems. In addition, these approaches are not simple and (I have not worked with this for a long time) I think they work primarily when you have some responses from customers just not all of it. I could be wrong there...as I said its been a long time.

If you are interested I can try to dig out the tome I created on this topic, but its not been rewritten and it uses SAS extensively.
 
#8
50 percent is a lot. But polls, increasingly, have very low response rates and are still used in the academic literature. Whether this is reasonable is beyond my expertise.

Part of the issue on how reasonable it is to use such approaches is whether the people who did not respond differ systematically from those that did respond. Whether the data is MAR, MCAR and so on. Multivariate Imputations works best of the methods that have missing data I think, but if the reason its missing is tied to something about the customers who did not respond, you are going to have problems. In addition, these approaches are not simple and (I have not worked with this for a long time) I think they work primarily when you have some responses from customers just not all of it. I could be wrong there...as I said its been a long time.

If you are interested I can try to dig out the tome I created on this topic, but its not been rewritten and it uses SAS extensively.
Hi Noetsy,

thank you for your response. The data is definitely missing completely at random - I did some tests to test if its missing at random or not before. Thank you for offering to upload the information, that would be lovely :)
 
#9
I think that there's a lot to unpack here, since you didn't give much information as to what you are planning on doing with the data. As such, I'll just have to speak generally about your situation/data as a whole, and I'm assuming that you will be working with 3+ variables at a time (multivariate statistical analysis).

The way that I would approach this (and not saying that this is the quickest or prettiest way) is by subsetting my variables based upon participants and questionnaires of interest. This is to ensure that whatever association/analyses I conduct, missing data is omitted. Sort of like the data that goes into Multiple Correspondence Analysis (MCA), your variables can be the questionnaires themselves, but those variables have categories (the questions on those questionnaires). If a participant didn't complete a questionnaire, their data can't really be considered in any analyses using that questionnaire, as all of the categorical data within that variable (questionnaire) would be blank. Thus, your sample size would just be decreased upon omission. You would then have to do your own comparison on the effects of other variables before and after missing data omission to determine whether or not it's significant. If the changes are significantly different in some variables, then you'd have to ask yourself whether or not those changes are relevant to your study.

If you did still want to fill in missing data (again, I would not suggest this), then you're really limited in what you can do. Pairwise Deletion, bivariate correlation is estimated on all dataavailable for each successive pair of study variables, really only performs well with MCAR data with a large sample size and with less than or equal to 5% of the data missing. The problem then if you decide to use your Pairwise-Deleted data is that most multivariate analyses will not run on it. If you are interested as to why, Pairwise Deletion has the tendency, much greater than Likewise Deletion (another technique dealing with multivariate missing data), to result in a non-positive determinate of the variance-covariance matrix.

The most promise for multivariate missing data lies in Multivariate Imputation (MVI). I'd look into it if you really can't eliminate your missing data. Disclaimer, it is sensitive to missing data bias and high measurement error, increasingly so with smaller sample sizes. In this case, you input sample size would be the number of people WITH data, with the intention that MVI will output the remainder of your data.

Again, just because this method shows the most promise, that does NOT mean that you should use it here. With 51% of your data missing for some of you questionnaires/variables (and thus "x" number of categories within that variable), I would suggest data elimination and analyzing the differences between data before and after missing data elimination! Hope I was able to help in SOME way!
Hi Devin,

thank you a lot for your email, it is very helpful! You're right, I will be working on multiple variables at a time. In fact, I'm trying to come up with less items to measure the same outcomes. I will therefore conduct a Factor Analysis and Cluster Analysis. Your response helped me a lot and I will definitely follow your suggestions. Again, thank you so much for your help.