dropping variables from model

noetsi

No cake for spunky
#1
I recently read a well regarded regression book where they strongly argued that one should not drop variables from a model because they were statistically insignificant (or really ever if they have theory behind them).

But I see this all the time.

"Considering that correlational analyses indicated high correlations between two pairs of variables: the MPAI Emotion scale and PCL-22, as well as the Cognistat Memory scale and CIQ Productivity factor, the first run of logistic regression analyses included seven variables and two interactions as predictors of employment status. Given that interactions did not significantly predict employment, logistic regression was rerun without the interactions"

In this case they are interactions. But they are still dropping variables. My point is, if you should not drop variables why is the practice so common in the literature.
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
@Dason that post deserves double likes but that just results in no likes. So single like it is.

@noetsi most people running analyses probably only had a couple of stats courses or never fully committed to fidelity toward best practice.

Just imagine there are lots of scrappy results and practices out there. How is the new job going?
 

noetsi

No cake for spunky
#4
I did not get the new job, it still hurts. It was a dream job but someone else got it. I really thought I would.

I understand both points hlsmith and dason. But this is extremely common and its in peer review journals (a lot).

It is not clear to me who is right. But I suspect that this is the norm not the exception in journals.

I guess the logic is few professors are aware of the views I referenced that you not leave variables out of the model. Which is disconcerting. Because I read academic journals to find out the right way to do statistics (this is where I ran across this). If you can't rely on academic journals where does that leave you ?

Are there journals which I can reasonably assume the writers are experts in statistics. :p I guess statistical journals, but I don't have access to them generally and they in any case are not writing on topics that concern me.
 

hlsmith

Less is more. Stay pure. Stay poor.
#5
Yeah - that is a distinction - you are looking at applied journals not methodologic It is typically not appropriate to add something and then remove it without penalizing yourself or using a holdout set. People perform less than ideal stats constantly - you just have to accept that. Published research can be broken and it self-policing.
 

noetsi

No cake for spunky
#6
Obviously the self policing does not work.

It would be easier if I had any theory to build on :) I don't mean methods. I mean VR.

I am giving a lot of thought to just abandoning stats and try to be a SQL coder. I have doubts that I will ever understand statistics well enough to use it. And I have doubts if you should run stats when you are not aware of issues like this. You can give wrong answers on something that is not theory but has real impact on people' lives. About 12 years invested in stats, but at a certain point you have to cut your losses.
 

hlsmith

Less is more. Stay pure. Stay poor.
#7
VR = ?

The theory would be only put things in your model that are associate and occur before the outcome. Playing around with model building can result in generalizability issues due to finding chance associations. In my class I should that if you have 20 predictors you should expect 1 into be a false positive at the 0.05 level. I then look at all 200+ comparisons between the variable and cherry pick the DV and can explain ~20% via R^2, but in the sample all of the variable are independent. Thus I can show that it is very easy to PHACK up a publishable model.

This is why if you have lots of data, it is good to have a validation and holdout set if you are going to do organic model building.
 

noetsi

No cake for spunky
#8
vr is vocational rehabilitation. It is the field I work in. They do very little regression. Or methods period. There is essentially no theory in which I work it is a dying academic field and was never empirical at its height.

My crisis of confidence is that I realize I am never going to learn all the complex aspects like this, and sometimes even when I do know of them I do not know how to do their fixes which require methods or software I don't know. So my basic regression will give wrong answers - how wrong I don't know.

I am not sure you should run regression this way - if you are not an expert. I know it is done all the time but I don't want to provide results that are wrong and I no longer have any illusion I will ever be an expert. To many times I have read an article that says this way you do things, that your read many times including in regression books generates totally wrong results.
 

hlsmith

Less is more. Stay pure. Stay poor.
#9
All models are wrong, but some are useful! Don't sweat it. You know a bunch - you just second guess yourself when you hear conflicting information.