Combining variables prior to performing regression analysis


New Member
Hello to all. I am running a regression analysis to evaluate if the joint analysis of costs and benefits can explain medicinal plant selection better than only benefits. My 'benefit' variables are 'perceived efficiency' and 'perceived taste' and my 'cost' variable is 'difficulty of acquisition'. My response variable is 'use'. I chose a health problem (constipation) and asked people in a local community to indicate plants known to treat it. Then I asked them to rank plants from (a) the most used to the less used, (b) the most efficient to the less efficient, (c) the most difficult to acquire to the less difficult to acquire, and (d) the tastier to the lass tasty. Analyses were based on average ranks. The average rank for 'use' was the dependent variable. Ten plants were used for constipation and their average rank rangerd from 1.8 (most used) to 6.4 (less used). The independent variables were calculated the same way.
After that I did two different things that resulted in different results
1) As all variables have the same nature, minimum and maximum values and the same medium values I combined them a priori to run a simple linear regression. I promoted several combinations (with sums to preserve the linear nature) and I found that the combination of “efficiency + taste - difficulty of acquisition” leads to a higher value of R² (0.8, when the best variable alone has a R² of 0.38). I compared regression lines for the combination cited above and the isolated variables and I found significant differences.
2) I performed a traditional multiple linear regression. Although the three variables alone can are all significantly related to the response variable, when they are together in the model they have no relationship with “use”. When I do a stepwise approach, only one variable is left in the model (taste) with a R² of 0.38.
I don’t’ know if I am forcing the data, but to me It is difficult to conclude that only taste explains use when I found a R² of 0.8 with an a priori combination of variables.
The thing is: is there something wrong on doing what I did in the first set of analysis? What would be the best way to discover the best combination of variables to explain “use”? Is there another way to reveal the best combination?
I apologize for my superficial knowledge on the subject. Can anyone help me?
Last edited:


Super Moderator
Your dependent variable is a set of 4 ordered categories, not a continuous variable. So multiple regression isn't really appropriate here. Ordinal logistic regression might be a better choice.


New Member
Acually the output of my dependent variable are average ranks. I have 10 plant species and each of them has an average ranking of use (e. 4.1 - 5.2 - 1.8) If the plant was cited as the most used by many people, it reached rank values close to 1. That is why I cannot use logistic regression.
All independent variables were calculated in a similar way.


Fortran must die
How do you create an average of a rank? Since they are not interval scale formally you can't average them (although admitedly this is done fairly often I would guess and I do it myself). :p