I am currently analyzing the effects several independent variables have on dependant variable (political party's position on immigration, to be more specific). As the cases (observations) here are political parties - the number of them is very limited (in fact, I have only 21 party with valid positioning on immigration). Thus, to make the results of OLS more accurate, I decided to 'expand' the pool of cases (observations). I have done this by including into the analysis parties of two different periods (2010 and 2014), thus increasing the total number of cases (observations) to 43.

After reviewing the analysis, my professor told that this may raise a problem as the assumption of independence of cases (observations) could have been broken. His argument is that there are parties, positions of which were measured in both 2010 and 2014 (let's say, party A had a position on immigration rated as 5 points (on 10-point scale) in 2010 and 6 points in 2014). Thus, his argument goes, that the cases (observations) are not independent.

However, after reviewing the result of Durbin-Watson test it showed that the assumption was not broken (DW value is 1,975).

Could you please suggest the way to progress further? Do I do nothing and say that the assumption was not broken (taking into account Durbin-Watson test). Or do I drop the cases (observations) that were measured in both time periods (but will that not decrease the accuracy of OLS as so few cases are analyzed)?

Thank you very much for your help!

Best wishes to everyone! ]]>

1) True/False: Models selected by automated variable selection techniques do not need to be validated since they are ‘optimal’ models.

(2) Compute the Akaike Information Criterion (AIC) value for the linear regression model

Y = b0 + b1*X1 + b2*X2 + b3*X3.

The regression model was fitted on a sample of 250 observations and yielded a likelihood value of 0.18.

(a) 9.49

(b) 11.43

(c) 25.52

(d) 15.55

(3) Compute the Bayesian Information Criterion (BIC) value for the linear regression model

Y = b0 + b1*X1 + b2*X2 + b3*X3.

The regression model was fitted on a sample of 250 observations and yielded a likelihood value of 0.18.

(a) 9.49

(b) 11.43

(c) 25.52

(d) 15.55

(4) True/False: Consider a categorical predictor variable that has three levels denoted by 1, 2, and 3. We can include this categorical predictor variable in a regression model using this specification, where X1 is a dummy variable for level 1, X2 is a dummy variable for level 2, and X3 is a dummy variable for level 3.

Y = b0 + b1*X1 + b2*X2 + b3*X3

True

False

(5) True/False: The model Y = b0 + exp(b1*X1) + e can be transformed to a linear model.

True

False

(6) True/False: A variable transformation can be used as a remedial measure for heteroscedasticity.

True

False

(7) When comparing models of different sizes (i.e. a different number of predictor variables), we can use which metrics?

a. R-Squared and Adjusted R-Squared

b. R-Squared and Mallow’s Cp

c. AIC and R-Squared

d. AIC and BIC

(8) True/False: When using Mallow’s Cp for model selection, we should choose the model with the largest Cp value.

True

False

(9) True/False: Consider the case where the response variable Y is constrained to the interval [0,1]. In this case one can fit a linear regression model to Y without any transformation to Y.

True

False

(10) True/False: Consider the case where the response variable Y takes only two values: 0 and 1. A linear regression model can be fit to this data.

True

False

1) False

2) b

3) c

4) T

5) T

6) F

7) D

8) F

9) F

10) F ]]>

I am a researcher in engineering. I do have some background on statistics and mathematics in general, but I am not a professional like many of you here. I hope you can give me some insight regarding the problem I am facing.

I have a set of data scattered more or less around the y=x line but the scatter does not start from the (0,0) point and it seems to increase with "x" (figure below). Up to a certain value of x, the data fit perfectly to a straight line. The actual best fit line is not y=x (let's say it is y=0.98x), however, it is meaningful in terms of the physical phenomena represented by the data to relate the statistics with the y=x line.

How would you describe this distribution in a meaningful and "powerful" manner as a statistician? For example, can I just calculate R^2 and say the data fits y=x with R2=...? This does not seem to be enough to me.

Thank you in advance,

best regards! ]]>