PDF was too large so here's the plot in word.
P.S: Forgot to complete the thread title - any way of altering that now?
So I have this data set and when looking for a relationship between two variables i created the attached plot.
Then i started to fit a model to the data. I was surprised that when i fitted a simple linear model the regression coefficients were significant:
Just looking at the plot satisfies me that there is no linear relationship between the variables - so how do i get such significant regression coefficients? Is it related to the fact there are > 2,500,000 data points?Code:Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.1904 0.1347 38.53 <2e-16 *** nucsomeI$med -2.1237 0.1090 -19.48 <2e-16 ***
Eye-balling the plot i thought an exponential model might be a better choice, so i tried:
Then how to interpret these significance levels - if everything i try is 'significant' surely i can no longer trust this measure to fit models.Code:Formula: nucsomeI.insert_count ~ exp(a + b * nucsomeI.med) Parameters: Estimate Std. Error t value Pr(>|t|) a 2.19786 0.02950 74.50 <2e-16 *** b -1.89507 0.09435 -20.08 <2e-16 ***
That said, i'm probably doing something stupid, so any help much appreciated.
PDF was too large so here's the plot in word.
P.S: Forgot to complete the thread title - any way of altering that now?
Yes.Is it related to the fact there are > 2,500,000 data points?
AFAIK, p-values are not indicators of model fits.Then how to interpret these significance levels - if everything i try is 'significant' surely i can no longer trust this measure to fit models.
They are about the probabilty of the data, given
the null hypothesis.
With kind regards
K.
So you have count data for the dependent variable that approximates normal distribution?
If you wonder if the model is just "over-powered", as your self if the interpretation of the coefficient is meaningful given the study context.
Can you tell us more about how these variables are formatted?
Stop cowardice, ban guns!
Cheers guys.
The dependent variable is count data. I am not sure how it is distributed - about 93% of the data has a count of zero, but then the maximum count is close to 100,000. I've been looking at zero-inflated poisson/negative binomial models, but i'm not really sure.
The independent variable is a continuous variable that possibly influences the count data. This is the most important question i wish to answer, but then i also thought it might be good to try to describe the relationship of the 2 variables via GLM.
I think you might have the wrong idea about what statistical significance means. (See my FAQ post here).
A small p value simply means that, if the true value of the parameter in the population was exactly zero, then it'd be unlikely that you'd observe a test statistic as large or larger than the one observed in your sample. It does not mean that the relationship you've observed is large or important or "significant" in a common-language sense.
So the null hypothesis in the first case is if we fit a linear model to the data then the chance of seeing those regression coefficients assuming the null hypothesis of the slope being 0 is given by the p-value? (what is the null hypothesis for the intercept - that it goes through the origin?).
Then when fitting the exponential model similar tests of significance are performed (the null being the parameters are equal to zero?).
How then to select the 'best' model? R only returned an AIC for the linear model. Looking at the plot i'd have thought an exponential curve fits better. There are presumably loads of models i could try fitting, but never get round to trying, i wouldn't ever know if i missed a 'better' model.
The null hypothesis for a particular parameter (e.g., a specific slope) is that the true value of that parameter in the population was actually zero. E.g., if we got the full population of data, and fit the model, the specific slope parameter we're looking at would be exactly zero. The p value is the probability of observing a coefficient (estimate) of that parameter as large as the one we've seen, if the null hypothesis was true.
Yes(what is the null hypothesis for the intercept - that it goes through the origin?).
Each null is generally for a specific parameter (aside from omnibus tests and so on).Then when fitting the exponential model similar tests of significance are performed (the null being the parameters are equal to zero?).
Definitely not via significance testing! Too big a topic for a casual reply, but try a search for regression model selection.How then to select the 'best' model?
CowboyBear (04-28-2016)
Tweet |