# Thread: linear model estimation: how many regressors to choose?

1. ## linear model estimation: how many regressors to choose?

Hi,

Suppose I have some timecourse, which I need to explain (it's a fMRI brain imaging example). In this timecourse I have two types of conditions, where one condition influence my timecourse two time stronger than the second (let's say 1 and 0.5). In addition, I have some zero conditions, where nothing happens. I can model my timecourse in two different ways:
1) one regressor, where I put "1" at the points of condition_1 and "0.5" at the points of condition_2. All the rest is "0"
2) two regressors, where in resgressor no 1 I put "1" at the points of condition_1 and in resgressor no 2 I put "1" at the points of condition_2. All the rest is zero.
Should my design choice result in different residuals unexplained error? I have hundreds of data points, so adding one more regressor doesn't hurts the power.

Bellow please find matlab simulation, where I get comparable results in both.

Thanks a lot,

function RegressionFitSimulation

%Here I define my paramaters
N = 100;
N_cond1 = 40;
N_cond2 = 40;
Noise_Sigma = 0.3;

%Here I simulate my Y timecourse
perm_index_arr = randperm(N);

cond1_indexes = perm_index_arr(1:N_cond1);
cond2_indexes = perm_index_arr(N_cond1+1:N_cond1+N_cond2);
cond0_indexes = perm_index_arr(N_cond1+N_cond2+1:N);

Y(perm_index_arr(cond1_indexes)) = 1;
Y(perm_index_arr(cond2_indexes)) = 0.5;
Y(perm_index_arr(cond0_indexes)) = 0;

Y = Y';
Y = Y + normrnd(0,Noise_Sigma,N,1);

%model with one regressor
x1(1:N) = 0;
x1(cond1_indexes) = 1;
x1(cond2_indexes) = 0.5;
X = [ones(N,1) x1'];
[b,bint,r] = regress(Y,X);
disp(['Beta0(hoteh): ' num2str(b(1)) ' Beta1:' num2str(b(2)) ' Sum of residiuls squares ' num2str(sum(r.^2))]);

%model with two regressors
x1(1:N) = 0;
x2(1:N) = 0;
x1(cond1_indexes) = 1;
x2(cond2_indexes) = 1;
X = [ones(N,1) x1' x2'];

[b,bint,r] = regress(Y,X);
disp(['Beta0(hoteh): ' num2str(b(1)) ' Beta1:' num2str(b(2)) ' Beta1:' num2str(b(3))...
' Sum of residiuls squares ' num2str(sum(r.^2))]);

2. The way I see it, the choice of which to use depends what kind of scale the conditions are on. Let's say that your conditions are, say, days of stroke rehab therapy received. If:

Condition 0 = no rehab therapy received
Condition 1 = 10 days of rehab therapy received
Condition 2 = 20 days of rehab therapy received

Then it would totally make sense to use one regressor, with multiple levels. If on the other hand you had something like:

Condition 0 = no stroke rehab therapy received
Condition 1 = Conventional stroke rehab received
Condition 2 = Chinese medicine -style stroke rehab received

Then you would need two regressors - there's no way you could justify saying that Chinese medicine stroke rehab* is 'twice as much' stroke rehab as conventional rehab. So the question is, do your conditions represent different levels of the same variable, or different variables entirely?

As far as effects on final analysis: having two regressors rather than one increases the degrees of freedom for the total model, reducing the statistical power, but would also likely result in a higher R2 (possibly just due to chance effects being better captured in the more complex equation, possibly due to non-linear relationships of condition level on dependent variable).

However, in your example for one regressor (10 vs. 20 days rehab therapy), what we actually should care is not weather one regressor value is twice the second, but whether the Y values of regressor=20 is twice than values of regressor=10. If it's not the case, then my fit will be lower, than in case with two regressors. Am I correct?

In general, it looks to me that unless I have to concern about degrees of freedom I better use two regressors. I don't see than what would be the advantage of using a single regressor.

However, in your example for one regressor (10 vs. 20 days rehab therapy), what we actually should care is not weather one regressor value is twice the second, but whether the Y values of regressor=20 is twice than values of regressor=10. If it's not the case, then my fit will be lower, than in case with two regressors. Am I correct?
Hmm, I don't think that's quite right, though I guess I confused the issue a bit by bringing scaling issues into the discussion! It's the practical meaningfulness and equivalence of intervals between data points on variable x that determine whether you can consider variable x as measured at the interval level - what data points the values of x relate to on variable y isn't really key to answering this question.

Anyway, if you can say to yourself that the different conditions represent different levels of ONE variable, and you've measured that variable on an interval-level scale, go for one scalar regressor. On the other hand, if you reckon that the different conditions represent different levels of one variable, but you can't justify an interval-level measurement assumption, OR the different conditions are best considered as two entirely different variables, you can go for either two regressors or one nominally-specified regressor (these should be equivalent, I think).

Time for coffee for me, good luck with everything

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts