Positive chi-square test: what to do next?


I'm analysing two categorical variables, Landslides (YES/NO) and Vegetation (grouped into 18 classes), and I'd like to explore and model the relationship between them.

Firstly, I've carried out a chi-square test for independence that has returned a positive result, i.e. the two variables are not independent, rejecting the null hypothesis.

Secondly, I would like to know if one of the variables is dependent on the other, and if so, how can I describe and model this relationship? My first though here was to carry out a logistic regression analysis of the two variables, testing the two variables against each other to see which returns a more "successful" result.

My doubts about this procedure are as follows.

Firstly, the chi-square contingency table had various cells with expected frequencies of less that 5 (14.7% of the cells to be exact). From reading around, some authors say that this value should be less than 20%, however I've also read that in cases like these the Fisher Exact Test would be better. Should I do this first to see if my results are reliable?

Secondly, the Vegetation variable is grouped in 18 classes. Is this too many? I could combine some classes if need be.

Thirdly, if the variables are not independent, what should be the next step? How can I know which variable is dependent and how should I test this?

Many thanks in advance!!


ps. While I wait for an answer, I'll start looking at the Fisher test.


TS Contributor
Not sure what you want to achieve. The Chi square tells you that vegetations are differently
distributed acorss areas with/without landslides. What do you mean by the question which
variable is dependent? Do you ask whether there is a test which can determine that landslides
influences vegetation, or that vegetation nfluences landside?

By the way, perhaps you should describe your study design and your data a little bit more,
maybe the Chi square was inappropriate from the beginning.

With kind regards


Firstly, thanks for the quick reply. I'll try to explain myself a little better here and provide more details.

The study we are carrying out looks at the different factors that might influence landslides, vegetation being one of them. To do this we have different maps (or layers) that represent environmental factors (e.g. vegetation, lithology, slope, altitude, soil type, etc.) and a map of the landslides in our study area.

Concentrating solely on vegetation, by sampling the vegetation and landslide layers we can produce a table with two columns: landslides (yes/no) and vegetation (one of 18 classes). The table has 21,262 registers and served as input to the chi-square test (see the attached contingency table). Assuming the chi-square result is reliable (see my previous doubts about this: should I carry out a Ficher Exact test??), it confirms the hypothesis: that landslides are distributed differently across different vegetation types. But here, as you asked, I'd also like to know if there's a test that can tell me which variable is the dependent and independent.

Clearly, the distribution of landslides is probably influenced by more than vegetation alone, however we thought it might be best to focus each individual factor individually to start with (using chi-square) before analysing the role of all the factors together. This would provide us with a set of factors that potentially influence the distribution of landslides and, by logistic regression analysis, we could model the relative influence and significance of each. Maybe I'm getting ahead of myself here, but we identified logsitic regresson as a possibility because of the nature of the variables involved, which are a mix of categorial and continuous measurements.

Anyway, I hope that explains a little better. If you wnt to know mre details, please let me know. This is the first time I've posted here so I'm a bit of a novice!

Many thanks, again!




TS Contributor
Maybe I'm getting ahead of myself here, but we identified logsitic regresson as a possibility because of the nature of the variables involved, which are a mix of categorial and continuous measurements.
Yes, you could do logistic regression whith vegetation as predictor, while conrolling for other factors.
Categorical measures can be used as predictors. As you mentioned before, you could perhaps consider
collapsing categories of your vegetation variable.

With kind regards

OK, thanks. I'll start with a logistic regression then. As the chi-square test is a preliminary I can always go back a re-class the vegetation at a later stage. My main doubt was whether I should worry about the low expected frequencies.

Many thanks,