# Statistical test to find association between two variables

#### Denis

##### New Member
I'm dealing with ecological data. Broadly speaking, i've counted the plant abundance (discrete variable) in a number of points (small blocks,one number for each point).There were about 50 of a such points totally. For each point (block) we determined a substrate type (nominal variable with the two levels, e.g. substrate A and substrate B). We need to test if there is a statistical dependence between substrate type and the plant abundance. E.g. to have an opportunety to say that the plant is usually more abundant on substrate of A type. In adition, it's worth to mention that the first half of my points (points from 1 to 25) were collected in one location and points from 26 to 50 in another locations, i.e. not all of my points are independant. Which statistical test i may use in my case?

This question was initially posted at:

https://stats.stackexchange.com/que...est-to-find-association-between-two-variables

#### Miner

##### TS Contributor
A few clarifying questions: 1) What do you mean by "not all of my points are independent"? 2) What is the relative magnitude (min/max) of your count data per point? (i.e. < 10, 100's, etc.)?

#### Denis

##### New Member
1) Totally i have about 50 points. For each point i have one number corresponding to plant's abundance (discrete variable) and one nominal variable (substrate type). One half of my points (points=blocks=samples) are in one geographical location and other half of the points are in another geographical location. Points from one location are nearby with each other and because of that fact not independent. Points from different locations are independent. Sorry if it was not clear from the description above.

2) From 0 to 1000000.

#### Miner

##### TS Contributor
Treat location as a blocking factor and substrate type as your treatment factor. If I understand your description correctly, points would be 25 replicates. So, this would be a 2^2 full factorial design with 25 replicates. Others may disagree with the following recommendation, but I have found that it works well in my field of industrial statistics. When you are dealing with large count data, you can safely treat it as if it were continuous data (see normal approximation to the Poisson). Or you can use the Freeman-Tukey transformation for count data.

#### Denis

##### New Member
Thanks a lot! Yes, i have the 25 replicates for one location and 25 for another location. What is 2^2 full factorial design? I guess the first 2 means two factors - location and substrate. Right? What the second 2 means? Do you recommend me to use linear regression (lm function in R)?

#### Miner

##### TS Contributor
See this article on factorial experiments, particularly section 4 Notation. 2^2 means a 2 level design with 2 factors. I would use a 2-way ANOVA, but you could use linear regression. I am not an R user, so I cannot make any recommendations on R functions.

#### Denis

##### New Member
Thank you for the explanation and useful link. I see that my explanation about dependence of the replicates within same location was not well although. Let's forget about plants and substrates for a time. Suppose we've measured height of multiple people in two countries (e.g. 50 replicates=people in each country). We would like to know if people's height in one country statistically different from those in another country. Now it's just one-way ANOVA. But let's introduce one additional experiment detail. Replicates in each country are not a random people, but members of one family. Obviously, in that case ANOVA is not applied.

In my experiment first half of my points (points from 1 to 25) collected in one location are to some extent just members of the same family and the same is true for the points from 26 to 50 in another location. They are not a random points within one location. Hope now it was clear. Sorry again for the bad explanations in previous posts.

#### Miner

##### TS Contributor
I understood. The blocking on location addresses that concern. Designed experiments started in agriculture and the concept of blocking and split plot experiments were developed to address that specific issue. The terminology of plots has the agriculture origin.