Complex study data analysis

Good afternoon,

I have a long-running experiment on music and emotion for my Ph.D. research. Big picture—there have been nine total ‘versions’ (instantiations of the experiment). There 5 different ‘designs’ that we ran in Dublin, Ireland. The fifth design is then what we’ve also run in New York, Manila, Bergen, Singapore, and currently in Taipei City. So, the overwhelming majority of our data are from this fifth design.

In this fifth design, subjects sit and are played 3 selections of music randomly drawn from a larger pool of music selections. The size and contents of this pool, as well as the number of selections played for each subject, changed in the first five versions, but has been the same from the fifth version onward (though the pool has now changed in Taipei). Subjects then answer a number of questions along 5-point Likert scales following each song, including (wording is not exact here):

  • How positive or negative did the music you have just heard make you feel?
  • How active or passive did the music you have just heard make you feel?
  • How in control of your emotions did you feel?
  • How engaged were you with the music you have just heard?
  • Etc.

We also gather physiologogical recordings (electrodermal activity and pulse) during each playback. Finally, we have begun to gather responses to the abbreviated Big Five personality inventory from each subject.

This is becoming a very large dataset—our current subject count is at about 12,000, and we forecast 25,000+ subjects by the end of next year.

The first question in line that I’m trying to answer right now seems fairly straightforward, but given the design of the study I have a feeling there’s more to this analysis than meets the eye. We first want to look at those first two ratings: positivity/negativity and activity/passivity. We are looking to find out if, according to these measures, children (≤ 10 years old) respond similarly to adults (> 10 years old). Here are some example questions I’d like to answer:

  • In general (for all songs in the bank), is there a difference between how children and adults rate the songs for these two scales?
  • For a given song, is there a difference between how children and adults rate the songs for these two scales?
  • Both of these questions, but considering the scales independently.

For the first question, I figured I would run a MANOVA with age group as my IV and positivity/negativity ratings and activity/passivity ratings for each song as DVs. My gut tells me that this is not the correct approach. Even if it is, though, there are a few issues with this that I can’t seem to figure out:

  • First, for a given song, not all subjects will have listened to this song. So, the n for children that have listened to song A will be different than the n for adults that have listened to song A, as well as the n for children that have listened to song B.
  • Second, for a given song, not all subjects that did listen to it will have heard it in the same sequence. We have observed that physiology, in general, becomes more ‘muted’ during the successive presentation of songs, and can assume that this may also hold true for self-report ratings.

To address these issues, previous analyses have only considered the first song played during each session, and further analysis has only proceeded, for instance, between subjects that have all listened to the same song first. Obviously, we’re losing a pile of data by doing this, and it would be nice to be able to block by song and by order in the sequence in which the song appeared. Is this possible? And if so, there’s still the issue of adjusting for differing n.

So, that’s an overview of what I’m working with right now, and there is plenty of other analysis to be done. But, this is the most pressing at the minute. If someone can give me some pointers on where to jump in with this particular analysis, beers are on me! :tup:

Thank you for your time if you've made it to the end of this post!


TS Contributor
This is becoming a very large dataset—our current subject count is at about 12,000, and we forecast 25,000+ subjects by the end of next year.
Does all this really mean that a study is conducted
with 25000 participants and with quite costly measurements
(at least in terms of particpants' time consumed), and
there has been no research proposal or application which
outlined the data analysis beforehand? Nor is it planned
to involve an expert for data analysis?

I am asking this out of curiosity, but also because I might
have understood it wrong, and it could perhaps be possible for you
to describe which strategies for data analysis had been proposed.

With kind regards


BTW, it's Likert-type items here, not Likert scales?
Likert scales consist of several Likert-type items.
That means, the well-knoiwn discussions about
scale level (interval vs. ordinal) of Likert scales do
not apply to your dependent measurements here.
So you seemingly have ordinal scaled DVs.

I can't speak with certainty to all of your questions, as I 'inherited' this project after its initial design, but I will do my best.

Previous analyses with these data have been performed side-by-side with 'analysis experts'. And, there are plenty around now who would 'like' to be involved. My own work, which uses these data, cannot wait around for those who would 'like' to be involved to actually get up and do it. This is why I am seeking guidance on how to proceed in exploring the dataset further. Also, because of the means and ease with which we are able to recruit participants, and the convenience of participation on their parts, gathering more data is actually relatively cheap for both us and them.

That said, I am not entirely convinced that the design is perfect. We are, fortunately, in a position to modify the design at any time, should we see fit. Furthermore, I've taken enough statistics courses to know that analyses of data from a complex design like this are fraught with the potential for plenty of naïve mistakes. I've performed ****ed well enough in those courses to know that I certainly don't know enough to proceed blindly and avoid all of those mistakes. Hence, I'm looking to others more knowledgeable than I for guidance.

About the questions. They're Likert-type items in the sense that they were designed such that choices would be interval-scaled, not ordinal-scaled. On the other hand, they are simple rating scales in that individual items are not meant to be thought of as parallel constructs that combine to form a larger scale. Does that make sense?

Finally, thanks again for your time.