Odds ratio or other technique


( I have also posted in https://stats.stackexchange.com/questions/535284/odds-ratio-or-other-technique)

Before I start I am not well versed in posting on forums so please be patient if I'm going against convention. Plus I am on a steep learning curve with statistics.

I have been presented with some data which originates from a survey filled in by workers who work in the chemical factories (made up data is attached in an excel spreadsheet to illustrate the question - in reality the sample size is larger).


I am being asked to statistically analyse whether there is a significant connection between workers who were diagnosed with cancer whilst working in the chemical environment and whether the workplace has designated clean and dirty areas.

1) Am i correct in saying that i can conduct an odds ratio calculation with confidence intervals and p value as below (I would have to class 'Y - not well adhered to' as 'N' in this case. Is this recommended):

Capture odds ratio table.JPG

2) Would odds ratio be the recommended approach or are there more robust methods?

3) It may be that I have to exclude some participants who were diagnosed with cancer whilst working with chemicals, but have declined to answer. Does this need taking into account in some way?

Thank you for your time.



Last edited:


Less is more. Stay pure. Stay poor.
Yes your approach would be fine. If you think Clean pPace may be protective, then you may switch the ordering of the rows to get an odds ratio on the positive side.

Not sure of the origins of your data, but you could have survivor bias.

As for persons not collected, if there is reason to suspect a systematic bias (selective loss based on a certain exposure and outcome group) - there are quantitative bias analyses as well as probabilistic quantitative bias analyses that you can conduct. To do these you would need to have a validation sample or assumptions of the proportion of subjects that may be in these scenarios. An additional approach may be to just play around with the numbers to show how many missed persons it would take to nullify the results - given you you have non-null results. There is an approach called Evalues by Tyler Vander Weele that can be used to quantify the impact of selection bias on results similar to this.
Thank you hlsmith - that is really helpful and reassuring.

The origins of the data are a survey of all workers in that industry - so it is their choice whether to complete the survey. Does this have any bearing on the analysis?

I have a couple more questions if you don't mind:

1) Is there any reason why I shouldn't/couldn't use the Risk Ratio instead?

2) We have another field (smoker/non-smoker) not included in my example. If we were to target these workers with some further survey questions in the future, would we then need to take this into account when analysing the data returned? because in effect we are taking a sample from within a sample?

3) I am planning on using rstudio to calculate the odds ratio as mentioned in the original post - do you happen to know which is the best package to use fort this?




Less is more. Stay pure. Stay poor.
I would be transparent when reporting results - on the threat of survival or selection bias.

The ideal estimate to report is the risk difference. However the default in survey data is to report odds ratios since the outcome and exposure were collected at the same time (cross-sectionally), so you may not know the ordering of the exposure and outcome necessarily.

You can incorporate smoking status via multiple logistic regression or by using your contingency table approach for smoke and then non-smokers (stratification: run two tables).

In R, glm with link=log and dist=binomial could be sufficient for multiple regression (controlling for both variables at the same time).
Hello again,

I am in a position where I think I have calculated an odds ratio and interpreted it with 95% CI. I would really appreciate it if someone could cast their eyes over it to see if I have calculated it correctly (I have done it by hand to try to get to grips with the calculation).

The first of the two tables shows the number of workers who diagnosed cancer whilst working in this environment and whether or not there is a designated clean area. This could be yes, no or yes but the area is not adhered to.

I have decided to group this middle group in with the group that does NOT have a designated clean area. This is reflected in the second table.



a: Odds of diagnosing cancer after starting in fire service without clean/dirty areas = 217/6334 = 0.03426

b: Odds of diagnosing cancer after starting in fire service with clean/dirty areas =95/4015 = 0.02366

Odds Ratio (OR) = a/b = 1.4480
(This suggests that the odds of being diagnosed with cancer where there are no designated clean/dirty areas is 1.448 than when there are designated clean/dirty areas.

To calculate the 95% confidence intervals

Convert the OR to ln(OR) = ln(1.448) = 0.3702

95% Conf Int = ln(OR) +/- 1.96 x SE(ln(OR))

SE (Standard Error - Walds method) = square root ((1/95)+(1/217)+(1/4015)+(1/6334))

So, SE = 0.12468

And, Conf Int = 0.3702 +/- 1.96x(0.12468)

= 0.3702 +/- 0.24437 = (0.12583, 0.61457)

So ln(OR) = 0.3702 with 95% CI (0.12583, 0.61457)

…and using exponential to convert back from ln:

OR = 1.4480 with 95% CI (1.13, 1.84) -

ie as the interval doesn’t cross 1, we can be 95% sure that the odds of being diagnosed with cancer where there are no designated clean/dirty areas is between 1.13 and 1.84 than when there are designated clean/dirty areas.

Any comments would be helpful,




Less is more. Stay pure. Stay poor.
Seems fine. Your CI interpretation is a little off.

Upon repeated sampling of the population, you would expect 95% of the CIs would include the true OR. Frequentist interpretations read weird :)