SEM - general applicability of analysis, variable aggregation, and data sufficiency


New Member
SEM - manifest variables and data sufficiency

Dear all,

I just discovered this amazing forum and was wondering whether some of you may have answers for some questions I could not find any clear answers to on the web or in tutorials:

Research Design:
The research I am doing is in the context of a Master Thesis in Organizational Behavior. So concerning my level of statistical knowledge: I don't have a well-founded background in statistics, but am able to read up on the meaning of indicators and interpret test in SPSS and AMOS. ;)

I am proposing a model where fours latent 2nd order variables influence each other consecutively in a circular manner (a non-recursive model with a feedback loop). (V1 ---> V2 ---> V3 ---> V4 ---> V1 etc.). I collected data at two points in time. T1 (June, N=295) T2 (July, exactly 1 month later N=192). I did this to test whether the effects I am proposing are visible in the short-term. A MANOVA showed that the effects over 1 month are non-significant (so turns out they are not...). However, the bivariate correlations between my study variables at point T1 neatly fit my hypotheses. Thus, I am going to continue my analysis using just the T1 data set.

Because I wanted to plan for the possibility of over-time effects and several "longitudinal" papers in the field I am analysing (employee engagement) employed this technique, I spent a few days reading up on Structural Equation Modelling. However, now that I read about it I find it a very neat technique and would like to continue using it even though I am using cross-sectional data.

However, I have a few doubts, and I would be glad if you could give me some advice:

Measurement Model and manifest variables:
In one paper ( p.238, section 2.3 at the end of the page) I looked at for advice on how to structure my tests, the authors talked about using "manifest variables" to "reduce the complexity of the SEM model". Specifically the section says the following:

Due to our relatively small sample size, we reduced the complexity of our hypothesized SEM models (i.e. the number of freely estimated parameters) without paying the price of losing information, by using manifest variables (Jöreskog & Sörbom, 1993). To use scores for our ‘job resources’, ‘personal resources’ and ‘work engagement’ manifest variables that encapsulate the factor loadings of their underlying dimensions, we calculated their weighted factor scores. Specifically, we conducted second-order principal axis factoring (PAF) analysis with varimax rotation on the five job resources, the three personal resources, and the three work engagement dimensions at both measurement times. The advantage of this method is that it takes into account the factor loadings of each sub-dimension, while calculating the factor score. PAF analyses resulted in one job resources factor (42% of explained variance at T1 and 41% at T2), one personal resources factor (32% of explained variance at T1 and 38% at T2), and one work engagement factor (68% of explained variance at both measurement times). Thus, the manifest ‘job resources’ variable represented the factor score of the five job resources scales, the manifest ‘personal resources’ variable represented the factor score of the three personal resources scales, and the manifest ‘engagement’ variable represented the factor score of the three work engagement subscales.
I have two questions about that:
Firstly: If I understood that passage correctly, they simply aggregated their variables (by calculating averages) to have fewer paths in their model. Afterwards they did a factor analysis to find out how much of the first order variable's variance these factors now explained? I am a bit confused about why they did that because earlier in the paper (in the paragraph before the one I included here) they say that their 2nd-order variable structure already resulted from the confirmatory factor analysis that they did while deriving their measurement model. If the exploratory factor analysis afterwards was just to derive their first order variable loadings on their second order variables, that could have been said in one sentence... So it would be great if you could tell me if I misunderstood something that might be relevant to my analysis; I too am aggregating my data to second order variables.

Secondly: Calculators that indicate sufficient samples sizes for SEM, (e.g. this one don't ask for the relationships in the model, but for the total number of latent and observed variables. All of these latent and observed variables need to be included in the final model anyhow, don't they - because this model is dependent on the measurement model? So no matter how much you aggregate your data, the sample size you need to prove your hypotheses should not decrease unless you get rid of items (observed variables) or first order variables (latent variables)? Adding additional latent variables (second order variables) should actually increase your necessary sample size, shouldn't it?

I have 101 observed variables/survey items which I aggregate to 21 first order variables, which I further aggregate to 5 second order variables. According to the statistics calculator I mentioned above, my sample size of 295 should be sufficient to observe the effects I want to observe (the effect size should be larger than .2). However, now I am not so sure any more. It would be great if you could clarify whether my sample size is sufficient or if I need to take into account more factors when calculating the necessary sample size.

Thanks for reading through all of this and in advance for your answers! I greatly appreciate your help.

Last edited: