Could someone help me out with this (for you guys) simple question?

jellekol

New Member
Hey guys i'm currently working on my internship research and got stuck on something i hope you guys could help me with. So i have this database with 150+ variables and I want to see what the most important determinants are for foodsecurity (dependent variable) in Benin by using a regression analysis. But before i can do this i have to select which variables i will use in this analysis, so here is my question: How do i select my independent variables properly? Hope you guys could help me with this!

Karabiner

TS Contributor
How large is your sample size, what do you exactely mean by the term foodsecurity, what do you mean by determinants i.e. what is your objective/what do you want to achieve (find out single predictors? how manyof them? or, find a good prediction model? or something else?), what are these 150+ variables and why are there so much variables, on which scale were the 150+ variables measured (for example, all on an interval scale?), is there any pre-existing knowledge about which variables are particularly important for determining foodsecurity?

With kind regards

Karabiner

ondansetron

TS Contributor
Also, what database are you using? This might be helpful for understanding the overall structure of your problem.

jellekol

New Member

The dataset is from a demographic health survey and is gathered data from interviewing 5625 household members. The purpose of the research is to identify the driving factors behind food security in Benin, since there is no variable that directly measures food security I found research that supports that stunting (deviation from BMI) can be used as dependent variable as it was used in similair studies as well. Examples of variables that are in the dataset are: access to clean drinking water, owns cattle, hectares of farmland, income, etc. (all kinds of measure scales). So what i am aiming for is using a small portion of these variables in the final regresion analysis, say like 10. I have found similair studies in Sub-Saharan African countries that have found important variables regarding food security. I guess that my question is: Can i just select the variables i will use in my regression by just simply presenting the literature or is there some sort of test or analysis i have to do that determines which variables i have to include? And if i can just select variables on the basis of literature is it okay to just dismiss the rest or do i have to do something with it.

sorry to trouble you btw

Karabiner

TS Contributor
This is an extremely difficult topic, and there are no optimal solutions. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5969114/ With 150+ candidate variables the number of possible combinations of predictors and of possible models is extremely large.

Since the variables are heterogenous with regard to measurement and content, reduction of predictor variables to a smaller number of dimensions via factor analysis seems not feasable here.

There are algroithms for linear regression which could be used for variable selection (forward selection, backward elimination, all subset regression), but results are often not considered stable. At least you should build the model with one part of the sample and then use the other part of the sample to check whether results remain stable. The LASSO approach could be an alternative option https://en.wikipedia.org/wiki/Lasso_(statistics) .

I have found similair studies in Sub-Saharan African countries that have found important variables regarding food security. I guess that my question is: Can i just select the variables i will use in my regression by just simply presenting the literature or is there some sort of test or analysis i have to do that determines which variables i have to include? And if i can just select variables on the basis of literature is it okay to just dismiss the rest or do i have to do something with it.
Any kind of preselection based on former studies and on theoretical considerations would be preferable
to "technical" or machanistic variable seection, in my opinion
For example, you could use a dozen or so predictors based on such considerations, build a model, and have
a look at whether R² is substantial and how the coefficients look like.

Just my 2pence

Karabiner