I am a researcher in engineering. I do have some background on statistics and mathematics in general, but I am not a professional like many of you here. I hope you can give me some insight regarding the problem I am facing.

I have a set of data scattered more or less around the y=x line but the scatter does not start from the (0,0) point and it seems to increase with "x" (figure below). Up to a certain value of x, the data fit perfectly to a straight line. The actual best fit line is not y=x (let's say it is y=0.98x), however, it is meaningful in terms of the physical phenomena represented by the data to relate the statistics with the y=x line.

How would you describe this distribution in a meaningful and "powerful" manner as a statistician? For example, can I just calculate R^2 and say the data fits y=x with R2=...? This does not seem to be enough to me.

Thank you in advance,

best regards! ]]>

I am currently doing a GIS and statistical project on whether there is an association between access to green space and deprivation at the neighbourhood level (LSOA, MSOA). I performed a linear regression and geographically weighted regression on both scales and got no significant results indicating no relationship which was unexpected. I then performed a Morans I which showed the deprivation in areas was highly spatially autocorrelated (0.67,p=0.001). In my discussion does it make sense to say that maybe the spatial autocorrelation of the deprivation data hampered my ability to distinguish an independent association which is why i didnt find one?

Basically I am just trying to find out statistically why I didn't get the expected results for my discussion. Any help would be greatly appreciated :) ]]>

(1) True/False: Models selected by automated variable selection techniques do not need to be validated since they are ‘optimal’ models.

(2) Compute the Akaike Information Criterion (AIC) value for the linear regression model

Y = b0 + b1*X1 + b2*X2 + b3*X3.

The regression model was fitted on a sample of 250 observations and yielded a likelihood value of 0.18.

(a) 9.49

(b) 11.43

(c) 25.52

(d) 15.55

(3) Compute the Bayesian Information Criterion (BIC) value for the linear regression model

Y = b0 + b1*X1 + b2*X2 + b3*X3.

The regression model was fitted on a sample of 250 observations and yielded a likelihood value of 0.18.

(a) 9.49

(b) 11.43

(c) 25.52

(d) 15.55

(4) True/False: Consider a categorical predictor variable that has three levels denoted by 1, 2, and 3. We can include this categorical predictor variable in a regression model using this specification, where X1 is a dummy variable for level 1, X2 is a dummy variable for level 2, and X3 is a dummy variable for level 3.

Y = b0 + b1*X1 + b2*X2 + b3*X3

True

False

(5) True/False: The model Y = b0 + exp(b1*X1) + e can be transformed to a linear model.

True

False

(6) True/False: A variable transformation can be used as a remedial measure for heteroscedasticity.

True

False

(7) When comparing models of different sizes (i.e. a different number of predictor variables), we can use which metrics?

a. R-Squared and Adjusted R-Squared

b. R-Squared and Mallow’s Cp

c. AIC and R-Squared

d. AIC and BIC

(8) True/False: When using Mallow’s Cp for model selection, we should choose the model with the largest Cp value.

True

False

(9) True/False: Consider the case where the response variable Y is constrained to the interval [0,1]. In this case one can fit a linear regression model to Y without any transformation to Y.

True

False

(10) True/False: Consider the case where the response variable Y takes only two values: 0 and 1. A linear regression model can be fit to this data.

True

False

Answers to What I tried:

1) False

2) I tried the formula nlog(RSS/N) + 2k, not working

3) I tried -2ln(likelihood) + ln(N)*K, not working

4) TRUE

5) True

6) False

7) D

8) False - we need smallest

9) False - not sure why

10) FALSE - Logistic, not linear ]]>