Stats for linguists

Hi all.

I've been working on my PhD for a year and a half now, and I'm still no further forward in understanding how statistics works.

At the moment, I'm analysing a few vowel sounds, and I want to figure out if the production of these vowels are a result of a bunch of social and linguistic factors.

I have 3 groups:

Group 1 = 3 speakers
Group 2 = 3 speakers
Group 3 = 1 speaker

I have taken 50 tokens per speaker, which I then subdivided into 6 groups following linguistic environment. Due to the nature of the data, however, there was no way to control for the number of tokens per environment per speaker. So, for environment 1 I have (for example):

Speaker 1 = 14 tokens
Speaker 2 = 7 tokens
Speaker 3 = 19 tokens
Speaker 4 = 9 tokens
(and so on...)

In other environments, I have speakers which have 0 tokens.

When I ran a one-way anova on the entire dataset (150 tokens vs. 150 tokens vs. 50 tokens), I got significant results, but I'm not sure I was right to do this, especially since I'm not comparing 'like with like'. Short of filling in the gaps in the tokens per environment, is there any way to test for significance in small (really small) numbers per enviroment?

I also need to run a regression model to figure out which factors are most at play in explaining the variation, but I'm not sure how to do that.

I know this is a massive ask, and I've ordered a few books to see if I can get somewhere with this, but I'm totally stuck and I really need to figure this out. Hoping someone can help.


KoG :)