Multi-level paired data: mutliple linear model or linear-mixed-effects model? or other? (with R software)


New Member

I have a stats thinking about how to best analyse my data.

I have sampled forest floor samples in 10 sites, 1 sample per site. In each sample, I quantified the density of 5 different bacteria strains. I want to know if one bacteria strain is more represented on average than others, and also I want to know if one site has significantly more bacteria than others, and finally if there is an interaction density: strain. As far as I understand, I cannot assess any interaction because I only have one observation per bacteria strain per site (1 sample). Furthermore, the data is “paired” because each sample (=each sampling site) is screened for the 5 bacteria strains. I wonder what kind of statistical test would be the most appropriate.

Assuming assumptions are met, if I run one-way anova and posthoc tests, I cannot account for the “multiple paired data” feature (each sample has 5 strains observations, so these observations must account for the potential "sample" effect, eg some samples may have on average much higher densities of bacteria). I cannot compare the strains without account for the sample effect (which is here confounded with the site effect). For example, I think the following would be incorrect:
lm(density~ strain, data=data)

I could run repeated paired t.test or paired wilcoxon but I don’t know then how to correct for multicomparison of such multiple paired test. Is a two-way anova correct to use in such case?

lm(density~ strain + site, data=data) # no interaction possible because just 1 observation per strain per site

Otherwise, I was thinking about linear mixed-effect models, as for example:

lmer(density ~ strain + (1|site), data=dataset)

I wonder if I can split the “site” variable into two modalities (so n=5 per site now as per "west" vs "east" of the forest) , or as a numerical variable (eg distance from the midpoint) and so test the following model:

lmer(density ~ strain*site + (1|sample), data=dataset)

Hence I would account for the “paired” design of multiple strain analysis per sample, and the potential effect of their zone of sampling and the interaction of these. If the model assumption are not met, I wonder what else I could explore, maybe GLM… but I wonder if it is worth looking for more “sophisticated” statistical analysis or conform me with “simpler” stats.

If you have any help or comment I would really appreciate,
Many thanks all!