A question from a very amateur epidemiologist I am afraid, and a little long winded.

I am wondering how to produce the most valid confidence when designing and applying logistic regression models to calculate the Standardized Mortality Ratio (sum of observed / sum of predicted deaths).

The standard error can be calculated by using the Bernoulli distribution P*(1-p) where p = predicted probability (Pi hat), taking the sum of each subject to obtain the variance, then apply the sum to the predicted deaths using the formula

sum(p) +- z * sqrt(sum(variance))

I have seen a program to calculate the confidence intervals in this manner by applying the standard error to the numerator, rather than denominator of the SMR equation - does this make sense or is this an error? I would assume to apply CIs to the numerator (observed deaths) one would apply the standard error calculated this way to the proportion of expected - sum(p) rather than observed - sum(y) deaths. Would this, indeed be more valid.

The other methods to calculate the confidence intervals is of course to use the variance-covariance matrix to calculate the standard error of the estimate for each term in the information matrix(X), and then apply this standard error to our linear predictor, obtaining upper and lower bound confidence intervals for the logit of each subject before applying the inverse logit, and then adding these individual upper and lower bounds.

In particular, does the above method have any validity when forecasting the model, ie applying to a data set different to that on which the model was created. If so, I would assume the variance for each co-efficient should be taken from the original model (which would usually involve calculating backwards from the published confidence intervals on the odds ratios for each covariate, although confidence intervals of the intercept are rarely published), rather than recreating a new matrix - ie inverse(X'VX) on the new data set?

If this is not valid, should we instead use the standard error of the forecast (rather than standard error of the predictor) on the linear predictor to obtain upper and lower bounds of the confidence interval, before applying the inverse logit?

Thank you for reading and any replies.

I am wondering how to produce the most valid confidence when designing and applying logistic regression models to calculate the Standardized Mortality Ratio (sum of observed / sum of predicted deaths).

The standard error can be calculated by using the Bernoulli distribution P*(1-p) where p = predicted probability (Pi hat), taking the sum of each subject to obtain the variance, then apply the sum to the predicted deaths using the formula

sum(p) +- z * sqrt(sum(variance))

I have seen a program to calculate the confidence intervals in this manner by applying the standard error to the numerator, rather than denominator of the SMR equation - does this make sense or is this an error? I would assume to apply CIs to the numerator (observed deaths) one would apply the standard error calculated this way to the proportion of expected - sum(p) rather than observed - sum(y) deaths. Would this, indeed be more valid.

The other methods to calculate the confidence intervals is of course to use the variance-covariance matrix to calculate the standard error of the estimate for each term in the information matrix(X), and then apply this standard error to our linear predictor, obtaining upper and lower bound confidence intervals for the logit of each subject before applying the inverse logit, and then adding these individual upper and lower bounds.

In particular, does the above method have any validity when forecasting the model, ie applying to a data set different to that on which the model was created. If so, I would assume the variance for each co-efficient should be taken from the original model (which would usually involve calculating backwards from the published confidence intervals on the odds ratios for each covariate, although confidence intervals of the intercept are rarely published), rather than recreating a new matrix - ie inverse(X'VX) on the new data set?

If this is not valid, should we instead use the standard error of the forecast (rather than standard error of the predictor) on the linear predictor to obtain upper and lower bounds of the confidence interval, before applying the inverse logit?

Thank you for reading and any replies.

Last edited: