I was hoping someone may be able to assist. I have discussed this with two people who understand a great deal about stats, although I read something and I am wanting to question their advice without insulting them So I thought I would post here. OK here goes:

1. I visited 6 geographical locations. Each one had 50 individuals investigated for one disease. The prevalence was determined for said disease at each site.
2. A representative sediment sample was taken at each location consisting of 3x replicates. Each sample was investigated for an array of individual chemicals belonging to three classes. For simplicity sake in this post, we shall discuss the sum of all chemicals belonging to one of these classes e.g. ∑PCBs. NB: the mean of the 3x replicates was used for the analyses.

The aim is to investigate if an increase in ∑PCBs within the sediment sample correlates to an increase in the disease condition. Now firstly, lets recognise that there are essentially six data points whereby (a) x= the concentration of chemical measured (b) y= the prevalence of the condition and (c) the six points represent the 6 locations.

Now, I was under the impression that correlation was the way forward because correlations are based on the assumption that there is a linear relationship between X and Y. However, I was told that on balance, linear regression was probably best, although I needed to arcsin transform my dependent variable first because it was proportional data and logistic regression was arguably overkill on 6 data points).

Problem is that I was under the impression that linear regression usually rely on the independent variable being experimentally manipulated e.g. time; or, such as in toxicology studies, when you expose things to a controlled series of different chemical concentrations. In my example, both variables are being measured. I am measuring chemical concentration, although it is not being experimentally manipulated prior to exposure. Rather, it is being measured in the environment, and then used to investigate a relationship.

So my thoughts are, I actually should revert back to my original hunch of "correlation", although I should still arcsin transform the data (as advised) pertaining to the disease condition because it is proportional data.

As a side, I read that more people are moving to logistic regression over arcsin transformation, but this is surely overkill on 6 data points. Furthermore, the data fit approximately along the linear phase of both the logistic and arcsin transformation so arguably, do they even need transforming in the first place. However, for accuracy I chose to transform it i.e. its proportional data.

Anyone care to comment? I would appreciate feedback.

Many thanks,