# Modeling cancer incidence rates

#### kiton

##### Member
Hello dear forum members,

Using county-level panel data (N = 2500+, T = 3) I aim to examine the associations between multiple biological, socioeconomic, and psychological factors and cancer incidence. There are two outcome measures available for me:

(1) cancer incidence rate, IR = (New cancers / Population) × 100,000
M = 197.19, SD = 51.12, Min = 0, Max = 610.6, Skewness = .1, Kurtosis = 7.25

(2) count of new occurrences
M = 241.4, SD = 674.11, Min = 0, Max = 17,742, Variance = 454,427.7

My initial approach is to model (continuous) rates using OLS. However, although its distribution is relatively normal, there is some variation in the tails:  As a result, the OLS residuals are far from perfect: Question 1: What modeling approach would you recommend to address such variation? I realize quantile regression is one option (with its ups and downs), but perhaps there are other "standard" ways to model rates?

Question 2: Are there any reasons why I should use specifically rates, or instead counts for the purpose of analysis? Is there any general consensus on this?

Your feedback would be greatly appreciated.

Last edited:

#### kiton

##### Member
There is an interesting article on the topic -- Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the normality assumption in large public health data sets. Annual review of public health, 23(1), 151-169.

But I feel I am too old school and still believe in BLUE for trustworthy inferences #### ondansetron

##### TS Contributor
Normality assumption refers to the distribution of the errors, not of the dependent variable. You should look into Poisson and negative binomial regression for rates.

#### kiton

##### Member
Normality assumption refers to the distribution of the errors, not of the dependent variable. You should look into Poisson and negative binomial regression for rates.
Thank you for response. That's absolutely correct, and that's exactly why I include a residual plot to assess the model fit. Relatedly, if the outcome's distribution is abnormal, residuals would not be "normal" either.

Let me ask a clarification question on the second part though. If I were using a count as an outcome, then I'd definitely use either of those non-linear models with no reservations. However, what's the logic behind using those models for a continuous nearly normal outcome? Perhaps I am missing something here.

#### ondansetron

##### TS Contributor
Thank you for response. That's absolutely correct, and that's exactly why I include a residual plot to assess the model fit. Relatedly, if the outcome's distribution is abnormal, residuals would not be "normal" either.

Let me ask a clarification question on the second part though. If I were using a count as an outcome, then I'd definitely use either of those non-linear models with no reservations. However, what's the logic behind using those models for a continuous nearly normal outcome? Perhaps I am missing something here.
It is not generally true that a nonnormal Y variable necessitates nonnormally distributed errors.
And sorry, before I somehow did not read that you said the residual plot provided.

For the second part, I think “it depends” on a few things whether a count is reasonably handled by OLS in comparison to Poisson or Neg Binomial.

#### ondansetron

##### TS Contributor
Actually, are all 3 plots of residuals?

#### kiton

##### Member
It is not generally true that a nonnormal Y variable necessitates nonnormally distributed errors.
Your argument is surely correct. It's just somehow this is a relatively common scenario in my field.

For the second part, I think “it depends” on a few things whether a count is reasonably handled by OLS in comparison to Poisson or Neg Binomial.
Okay, I understand that if a count is reasonably handled by OLS in comparison to Poisson and NB, then it's plausible to use OLS for a count outcome. Let me double-check though if the same holds for a continuous outcome (rate) handled by, say, Poisson?

The first plot is a histogram of the continuous outcome rate, the second one is a quantile plot for the same outcome, and the third one is a plot of quantiles of residuals against quantiles of normal distribution.

#### ondansetron

##### TS Contributor
Can you post a histogram of the residuals, also? I am not fantastic at normal prob plots yet, so it would help me look at the tails (and learn with real data). But I suspect it wouldn't be too great of a concern for the normality of error distribution assumption.

What is the sample size?

#### ondansetron

##### TS Contributor
As far as the assumption regarding normally distributed errors, the provided histogram doesn't worry me at all. The sample size should also mitigate and "concerns" someone may have had.

As for the other assumptions or just the overall appropriateness for your question, that is something you would have to decide, though.