Modeling cancer incidence rates

#1
Hello dear forum members,

Using county-level panel data (N = 2500+, T = 3) I aim to examine the associations between multiple biological, socioeconomic, and psychological factors and cancer incidence. There are two outcome measures available for me:

(1) cancer incidence rate, IR = (New cancers / Population) × 100,000
M = 197.19, SD = 51.12, Min = 0, Max = 610.6, Skewness = .1, Kurtosis = 7.25

(2) count of new occurrences
M = 241.4, SD = 674.11, Min = 0, Max = 17,742, Variance = 454,427.7

My initial approach is to model (continuous) rates using OLS. However, although its distribution is relatively normal, there is some variation in the tails:

Screen Shot 2018-08-22 at 10.51.59.png Screen Shot 2018-08-22 at 10.52.09.png

As a result, the OLS residuals are far from perfect:

Screen Shot 2018-08-22 at 10.56.11.png

Question 1: What modeling approach would you recommend to address such variation? I realize quantile regression is one option (with its ups and downs), but perhaps there are other "standard" ways to model rates?

Question 2: Are there any reasons why I should use specifically rates, or instead counts for the purpose of analysis? Is there any general consensus on this?

Your feedback would be greatly appreciated.
 
Last edited:
#2
There is an interesting article on the topic -- Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the normality assumption in large public health data sets. Annual review of public health, 23(1), 151-169.

But I feel I am too old school and still believe in BLUE for trustworthy inferences :)
 
#3
Normality assumption refers to the distribution of the errors, not of the dependent variable. You should look into Poisson and negative binomial regression for rates.
 
#4
Normality assumption refers to the distribution of the errors, not of the dependent variable. You should look into Poisson and negative binomial regression for rates.
Thank you for response. That's absolutely correct, and that's exactly why I include a residual plot to assess the model fit. Relatedly, if the outcome's distribution is abnormal, residuals would not be "normal" either.

Let me ask a clarification question on the second part though. If I were using a count as an outcome, then I'd definitely use either of those non-linear models with no reservations. However, what's the logic behind using those models for a continuous nearly normal outcome? Perhaps I am missing something here.
 
#5
Thank you for response. That's absolutely correct, and that's exactly why I include a residual plot to assess the model fit. Relatedly, if the outcome's distribution is abnormal, residuals would not be "normal" either.

Let me ask a clarification question on the second part though. If I were using a count as an outcome, then I'd definitely use either of those non-linear models with no reservations. However, what's the logic behind using those models for a continuous nearly normal outcome? Perhaps I am missing something here.
It is not generally true that a nonnormal Y variable necessitates nonnormally distributed errors.
And sorry, before I somehow did not read that you said the residual plot provided.

For the second part, I think “it depends” on a few things whether a count is reasonably handled by OLS in comparison to Poisson or Neg Binomial.
 
#7
It is not generally true that a nonnormal Y variable necessitates nonnormally distributed errors.
Your argument is surely correct. It's just somehow this is a relatively common scenario in my field.

For the second part, I think “it depends” on a few things whether a count is reasonably handled by OLS in comparison to Poisson or Neg Binomial.
Okay, I understand that if a count is reasonably handled by OLS in comparison to Poisson and NB, then it's plausible to use OLS for a count outcome. Let me double-check though if the same holds for a continuous outcome (rate) handled by, say, Poisson?

The first plot is a histogram of the continuous outcome rate, the second one is a quantile plot for the same outcome, and the third one is a plot of quantiles of residuals against quantiles of normal distribution.

I appreciate your feedback, ondansetron.
 
#8
Can you post a histogram of the residuals, also? I am not fantastic at normal prob plots yet, so it would help me look at the tails (and learn with real data). But I suspect it wouldn't be too great of a concern for the normality of error distribution assumption.

What is the sample size?
 
#9
I apologize for such a late response, was gone for a while. The residual histogram is attached. Sample size is 2,500 US counties observed over 3 years.

Screen Shot 2018-08-31 at 01.12.56.png
 
#10
As far as the assumption regarding normally distributed errors, the provided histogram doesn't worry me at all. The sample size should also mitigate and "concerns" someone may have had.

As for the other assumptions or just the overall appropriateness for your question, that is something you would have to decide, though.