Linear regression - when errors are correlated and from different f distributions


New Member
Hi all,

This question relates to research assessing the hearing of new born babies.

As mentioned in the title, I wished to solve the univariate model y = m*x + c + err, where each error term err_i comes from an F distribution, F(d1, d2), with d1 a known constant and d2 a known linear function of i. I'll give details on how the model arises below if people are interested.

If I ignore the error correlation structure I think I can code an MLE estimator for m, but I'm now sure how to assess the confidence interval of m, or even what distribution the estimate of m will have (less than 300 observations if that's relevant)

So my question: how would I go about testing the hypothesis m = 0 for the uncorrelated error case (and if possible the correlated case). I can simulate the system for empirical data if required. Answers or directions to references would be most appreciated.

Thanks in advance xD.

Problem model (if you’re interested):

After presenting an auditory stimulus the brain’s activity is measured giving a discrete time series x(t). I'll use x_i(t) as the time series from the ith stimulus presentation. This time series is assumed to be of the form x_i(t) = h(t) + N(0,sigma), where h(t) is the brains response to the stimulus, and we wish to test if var(h(t)) = 0 (i.e. if there’s any brain response).

So to do this, we use an f test comparing the variance of the signal + noise, var(X), with X = {x_1(t), x_2(t), ..., x_n(t)} to the variance of the noise alone var(noise), noise = {x_1(t) - h(t), x_2(t) - h(t), ..., x_n(t) - h(t)}.

So var(noise) is calculated as expected by estimating h(t) as the mean of X, h_est(t) = E(X), so the jth sample of h_est, h_est[j] = E(x[j]) = E({x_1[j], x_2[j], ..., x_n[j]}).

Instead of the simple estimation of var(X), we use var(X) = n * var(h_est(t)), estimating the population variance using the variance of the sample means. This formula is correct under the null hypothesis that var(h(t)) = 0 (because all samples are drawn from the same population), but an overestimate if var(h(t)) > 0, (E(x[j]) != E(x[k]) because h[j] != h[k]).

This gives the f statistic:

Fstat = ( n * var(h_est(t)) ) / var(noise)

for the f distribution with d1 = the number of samples in x(t) and d2 = (n-1) * the number of samples in x(t).

From the form of Fstat we see that if var(h(t)) > 0, the f statistic will grow linearly in the number of stimuli presented (which also means it grows linearly in time) with m = var(h(t)) / var(noise), i.e. the SNR. In practice, during testing you can watch the Fstatistic increase over time, and know that it will eventually reach a value that satisfies the probabilistic confidence level.

The hope is by treating the f statistic as a time series, Fstat(n), that testing if m = 0 will prove a more powerful test than the f test on a individual elements of the series (i.e. once you can say with some certainty that the f statistic is increasing with n, you don't have to wait for the f score to satisfy the probabilistic confidence level.

There's a lot of repeated statistical testing going in the present form (test Fstat[1], then Fstat[2] then ...), the argument being that for large enough n these f statistics are so correlated (same test on almost entirely the same data) that corrections for repeat tests aren't needed. In practice the test is never completed until a minimum number of stimulus presentations have been given.