Significant

#1
So I have this data set and when looking for a relationship between two variables i created the attached plot.

Then i started to fit a model to the data. I was surprised that when i fitted a simple linear model the regression coefficients were significant:

Code:
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.1904     0.1347   38.53   <2e-16 ***
nucsomeI$med  -2.1237     0.1090  -19.48   <2e-16 ***
Just looking at the plot satisfies me that there is no linear relationship between the variables - so how do i get such significant regression coefficients? Is it related to the fact there are > 2,500,000 data points?


Eye-balling the plot i thought an exponential model might be a better choice, so i tried:

Code:
Formula: nucsomeI.insert_count ~ exp(a + b * nucsomeI.med)

Parameters:
  Estimate Std. Error t value Pr(>|t|)    
a  2.19786    0.02950   74.50   <2e-16 ***
b -1.89507    0.09435  -20.08   <2e-16 ***
Then how to interpret these significance levels - if everything i try is 'significant' surely i can no longer trust this measure to fit models.

That said, i'm probably doing something stupid, so any help much appreciated.
 

Karabiner

TS Contributor
#3
Is it related to the fact there are > 2,500,000 data points?
Yes.

Then how to interpret these significance levels - if everything i try is 'significant' surely i can no longer trust this measure to fit models.
AFAIK, p-values are not indicators of model fits.
They are about the probabilty of the data, given
the null hypothesis.

With kind regards

K.
 

hlsmith

Omega Contributor
#4
So you have count data for the dependent variable that approximates normal distribution?


If you wonder if the model is just "over-powered", as your self if the interpretation of the coefficient is meaningful given the study context.


Can you tell us more about how these variables are formatted?
 
#5
Cheers guys.

The dependent variable is count data. I am not sure how it is distributed - about 93% of the data has a count of zero, but then the maximum count is close to 100,000. I've been looking at zero-inflated poisson/negative binomial models, but i'm not really sure.

The independent variable is a continuous variable that possibly influences the count data. This is the most important question i wish to answer, but then i also thought it might be good to try to describe the relationship of the 2 variables via GLM.
 

CowboyBear

Super Moderator
#6
I think you might have the wrong idea about what statistical significance means. (See my FAQ post here).

A small p value simply means that, if the true value of the parameter in the population was exactly zero, then it'd be unlikely that you'd observe a test statistic as large or larger than the one observed in your sample. It does not mean that the relationship you've observed is large or important or "significant" in a common-language sense.
 
#7
So the null hypothesis in the first case is if we fit a linear model to the data then the chance of seeing those regression coefficients assuming the null hypothesis of the slope being 0 is given by the p-value? (what is the null hypothesis for the intercept - that it goes through the origin?).

Then when fitting the exponential model similar tests of significance are performed (the null being the parameters are equal to zero?).

How then to select the 'best' model? R only returned an AIC for the linear model. Looking at the plot i'd have thought an exponential curve fits better. There are presumably loads of models i could try fitting, but never get round to trying, i wouldn't ever know if i missed a 'better' model.
 

CowboyBear

Super Moderator
#8
So the null hypothesis in the first case is if we fit a linear model to the data then the chance of seeing those regression coefficients assuming the null hypothesis of the slope being 0 is given by the p-value?
The null hypothesis for a particular parameter (e.g., a specific slope) is that the true value of that parameter in the population was actually zero. E.g., if we got the full population of data, and fit the model, the specific slope parameter we're looking at would be exactly zero. The p value is the probability of observing a coefficient (estimate) of that parameter as large as the one we've seen, if the null hypothesis was true.

(what is the null hypothesis for the intercept - that it goes through the origin?).
Yes

Then when fitting the exponential model similar tests of significance are performed (the null being the parameters are equal to zero?).
Each null is generally for a specific parameter (aside from omnibus tests and so on).

How then to select the 'best' model?
Definitely not via significance testing! Too big a topic for a casual reply, but try a search for regression model selection.