Study design flaw - no participant IDs.

#1
Hi :wave:,

I have been asked to do the statistical analysis aspect on a mixed methods study that was supposed to follow a pre/post analysis method in regards to the quantitative aspect of it.

The hypothesis on the study is to assess:
1) If a leadership intervention had a statistical difference on two separate cohorts. The pre/post measurement is a questionnaire with two 7-point Likert sub scales. The two sub scales can be added up to create an 'overall' score per participant.

2) If a embedded learning intervention had a statistical difference on two separate cohorts. The pre/post measurement is a questionnaire that was developed by the researchers which comprises of 2 categorical data questions and 1 ordinal question.

I was brought onto the project after it was designed, funded and got ethics approval and just when data collection has started and the original analysis plan I was told it would be t-tests.

I have noticed the project doesn't administer participant IDs and when I asked "Why and how come?" was told that the data gathering sites would only allow completely anonymous data gathering. Worse still when data gathering was commenced the researchers designed the data gathering in a pre/post fashion.

Since no IDs were provided at all I have created my own system of tracking where every piece of data comes from and what stage of data gathering it is from just to be able to input it into SPSS.

So this no IDs issue raises two problems for me regarding analysis:
1) Due to not having participant IDs from the start I cannot link the pre/post measurements from Participant 1 PRE to Participant 1 POST etc. when entering it into SPSS (in a Participant column, PRE column and POST column) which doesn't allow me to conduct pre/post analysis due to violating the related group assumption and leads me to assume to treat the analysis as two independent samples.
2) Due to the data gathering following a pre/post method thus the groups not being independent samples which violates the independent groups assumption which in turn leads me to think that no pre/post AND no two independent sample analysis is possible regardless of if its parametric or non-parametric.

So my question is apart from means/median, standard deviation/standard error, and percentages are there any inferential statistics I could conduct to assess the effectiveness of the hypotheses? The measurement for Hypo 1 is parametric and the measurement for Hypo 2 is non-parametric with samples being unequal.

Many thanks in advance for any advice ye have to offer and hope it all makes sense. :)
 
#2
I would think you could use a repeated measures ANOVA for this type of situation -- it presumes you would be checking the same people for the same dependent variable pre and post. That's what you have, right?
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
Sorry to hear about this. I will think on it, but there are major issues as you point out.


Was there an intervention/exposure that was suppose to change outcome value?


The example I always supply is the you could have had a value of 5 pre and went to 7 post and the same thing could have happened in the opposite direction for someone else , resulting in identical mean pre and post. Can you give us a feel of how many people there are and how many per study site?


I am guessing you do not have any secondary data to triangulate or impute an ID? From my experience the options are very limited. I would be interested to hear if anyone has a suggestion.
 
#4
I would think you could use a repeated measures ANOVA for this type of situation -- it presumes you would be checking the same people for the same dependent variable pre and post. That's what you have, right?
Yes I am checking the same people in essence but I think a repeated measures ANOVA wouldn't work due to the fact the I do not have participant IDs. When inputing the data for it in SPSS you need a column for IDs and the other columns (e.g. PRE, POST et cetera) need to match the participant one linearly. So since no IDs were given I cannot tell which questionnaire was filled in by let's say Participant 1 at PRE or at POST thus I cannot input it into SPSS to conduct the analysis. I can only differentiate it by which Hypo measurement it is; and which cohort; and if its PRE or POST. I cannot specifically differentiate it per participant which is required for inputing data into SPSS.

The only way I was able to input the data to SPSS was by creating a coding system that treats them as independent samples cause the linearity issue between the 3 columns doesn't come into play. But I violate the two independent samples assumption of being independent groups when I know they are not. Which in turn makes me think no inferential analysis can be done.

I stand to be corrected as my understand of statistics is limited. Background wise I come from a psychology perspective with limited postgrad studies in statistics.
 
#5
Sorry to hear about this. I will think on it, but there are major issues as you point out.


Was there an intervention/exposure that was suppose to change outcome value?
Hypo 1 had an intervention that was given to the bosses of the participants to see if the culture in the organisation would chance. The PRE/POST data was gathered over 4 weeks of PRE intervention and 4 weeks POST re-administered and each participant only filled out the measurement once. Now again due to the lack of IDs I cannot track over the PRE and POST which participant has answered and which has not.

Hypo 2 had an intervention involving a serious game played by the participants to see if their organisational behaviour would change. So the study design gets more complicated since the PRE/POST data was gathered 4 week of PRE intervention and 4 week POST re-administered WEEKLY for the 4 weeks of PRE/POST. Now again due to the lack of IDs I cannot track over the 4 weeks of PRE and POST if participant replies are missing or the likes or who replied to when. That is why I have decided to just pool the data into an 'overall' PRE and 'overall' POST for Hypo 2.

This was applied to both separate cohorts.


The example I always supply is the you could have had a value of 5 pre and went to 7 post and the same thing could have happened in the opposite direction for someone else , resulting in identical mean pre and post. Can you give us a feel of how many people there are and how many per study site?
I will focus on one Cohort numbers wise just to keep it as simple as possible.
Cohort 1 had an overall PRE of n= 45 and POST of n = 17 for Hypo 1. The same cohort had an overall PRE of n = 100 and POST of n = 49 for Hypo 2 (where the measurements were re-administer weekly). Also you can see that the participant number link between Hypo 1 and Hypo 2 for Cohort 1 theoretically should be identical. Although it cannot be easily understood since 45*4 is not 100 but a total of 180. So over the PRE 4 weeks a total of 80 people didn't fill it out. And its the same story with the POST. Again cannot be tracked due to no IDs.


I am guessing you do not have any secondary data to triangulate or impute an ID? From my experience the options are very limited. I would be interested to hear if anyone has a suggestion.
Short answer is yes but not enough. The only information I have that allowed me to create IDs was name of site, cohort type, PRE/POST, week of data gathering, if the measurement is new or a repeat, and my own given ID to differentiate between each participant for the respective week. Having said that I cannot definitely say Participant 1 in week 1 is Participant 1 in week 2. Thus why I cannot enter the data as for a pre/post analysis in SPSS and the number is just arbitrarily so I can differentiate between the identical questionnaires for that week.

Hope it all makes sense and thank you!
 
#6
I think I might have a solution regarding my no IDs & pre/post style data gathering issues. What I haven't mentioned before is that we have two sites where the study was conducted. The reason I didn't mention this before is because currently I am focused on compiling a statistical report for one of the two sites which limited my perspective on the whole study to the 'relevant' parts.

So long story short I cannot conduct two sample independent analysis because the data violates the assumption of independent groups. What if instead I do analysis on Site 1 PRE Data & Site 2 POST Data; and Site 2 PRE Data & Site 1 POST Data. Instead of looking for a solution that tries to look at it as Site 1 PRE Data & Site 1 POST Data etc. And apply it to both cohorts on each Hypo. That way I would not violate independent group assumption and be able to conduct inferential statistics.

Now I know this would make it impossible to create a site specific statistical report for each site but at least the data analysis can be done for the overall study. Does anybody see anything wrong with this idea? Would such a large change in the analysis plan from the originally proposed one lead to any ethical implications such a claims for 'data mining'?
 
Last edited: