Evaluating a regression model.


No cake for spunky
An important issue I have done little on historically.

"Variables entered in the development model were selected using stepwise backward-elimination approach, starting with all previously reported significant predictors found theoretically and practically to be associated with 90-day employment outcome. All variables significant at the p < 0.05 level were included in the model."

Ok so stepwise is bad (the people who wrote this are smart people, their article shows it. So using stepwise puzzles me). We could use lasso. And they drop variables out of the model like most practitioners I think. But they split their data into two pieces and used the second piece to test the predictions in the first piece. Is dropping variables out this way valid when you do that?

"To examine the performance and goodness of fit of the model, we evaluated measures of overall performance, calibration and discrimination. Overall performance was evaluated using predictive accuracy, Nagelkerke R2 and Brier score statistics. Predictive accuracy assessed how well the model predicted the likelihood of an outcome for an individual client. The Nagelkerke R2 quantified the percentage of the outcome variable (90-day employment) explained by predictors in the model. The Brier score quantified differences between actual outcomes and their predicted probabilities, that is, the mean square error (Steyerberg et al., 2010). The Brier score ranges from 0 to 0.25, values close to 0 indicate a useful model and values close to 0.25 a non-informative or worthless model (Steyerberg et al., 2010)."

I have heard doubtful things about the use of R squared in logistic regression and am not familiar with the Brier score at all (I never heard about it before last night). What do others think about using this form of R square or Brier score in evaluating a model?

I am going to ask about calibration and discrimination next :)
Last edited:


No cake for spunky
Are these valid approaches (and what shortfalls do they have). This is the type of practical issues I don't find in many regression text. How you know your model actually works.

"Calibration (goodness of fit) refers to the agreement between observed outcomes and prediction (Steyerberg et al., 2010). As recommended by Steyerberg et al., we used the calibration plot (Cox, 1958) to graphically assess model goodness-of-fit. The calibration plot is characterized by an intercept , which indicates the extent that predictions are consistently too low or too high (‘calibration-in-the-large’), and a calibration slope , which should be 1 and which should be 0 (Steyerberg et al., 2014; Cox, 1958), indicating good calibration and thus, model goodness of fit. The commonly used Hosmer–Lemeshow test produced statistically significant lack of fit due to the large sample size in our study. The Hosmer–Lemeshow test tends to fail even for good models when sample size is greater than 25,000 (Yu et al., 2017). For these reasons, we did not rely on the Hosmer–Lemeshow test of goodness of fit.

Discrimination refers to the ability of the model to discriminate between employed and not employed clients at closure and was determined from the area under the curve (AUC) of the Receiver Operator Characteristic (Royston et al., 2009). The ROC curve is a plot of the true positive rate (sensitivity) versus the false positive rate (1-specificity) evaluated at an optimal cutoff point for the predicted probability. A useless predictive model, such as a coin flip, would generate an AUC of 0.5. When the AUC is 1.0, the model discriminates outcomes perfectly. Therefore, a good AUC statistic is closer to 1.0"


Less is more. Stay pure. Stay poor.
I don't think I knew Brier's score went 0-.25, I thought it was 0-1 bounded - I guess. It is pretty much looking at the same thing as Calibrations, but turning it into a score. The Calibration plot with confidence intervals is important in visualizing the fit. Also, discrimination is important, but not just crude AUC, you need to also think about the outcome prevalence and discrimination's parts (SEN and SPEC) - so you can optimize for either minimizing false positives or false negatives.

Yeah, R^2 is likely not bringing anything to the table.

Data splits are great when you have the data. LASSO isn't perfect, but addresses collinearity. But a bigger question is why is there collinearity? Just dropping variables is troublesome, since you don't know if you are dropping backdoor paths (confounders) or variables not contributing. I hate to tell you this, but the model has to be built based on theory. Coupled with data splits would be the best approach.

Once you realize that mediators, colliders, confounders, instruments, interactions, and moderators need to be understood, since a stepwise or regularizer (LASSO, etc.) don't know the relationships, they just optimize for some criteria. Also, once you also learn about Table 2 fallacy, model building gets even more perplexing without content knowledge.

You just need to use your knowledge and judgement - the above presented approach isn't terrible and is likely similar to what many people do and are taught.


No cake for spunky
Hlsmith any suggestions where I can read about calibration? It is new to me. I know of AUC although I have not used it. Strange that the literature i have read over the years rarely mentions this at all. Must be reading the wrong books. :p That is what I meant by "practical" and which I did not see in the regression books we discussed (maybe I just missed it).

They assume false positives are equally important as false positives which probably explains their AUC views.

I was not suggesting LASSO to deal with collinearity. I was suggesting it as an alternative to their use of stepwise which I don't see as valid. That is I saw lasso as useful in selecting variables only. I think if collinearity was an issue (which they don't address) ridge regression would be better. In your world hlsmith there is theory to build on. In my world, the world of these authors, there is no theory at all. You either use empirical data or you guess (in practice I think most base their decisions on their personal observations of clients, but this is not written down anywhere I have found).

This is a field where statistics are rare, there is little theory, and as far as I can tell few pay attention to the (weak in my opinion) academic literature. With the exception of me no one in my fairly large agency tests anything (not through statistics anyhow). It is not a data based world in the sense you would mean it (data is used a lot, it is just that it is a descriptive rather than statistical approach).

They had 60,000 data points.


No cake for spunky
"A simple calculation for evaluating the usefulness of prognostic tools in clinical practice is the Net Benefit analysis. We used net benefit analysis to evaluate the potential clinical consequences of using our model in counseling decision making. Net benefit attempts to quantify potential harms and benefits of classification error – false positive and false negative classifications.

In calculating error rates, we classified clients as positive when their predicted probability of the employment exceeds 0.515 (the optimal cut-off – the value that maximized sensitivity and specificity on the ROC curve) (Lalkhen & McCluskey, 2008) and as negative otherwise. This implies approximately an equal weighting of false positive and false-negative classifications. The basic interpretation of a decision curve is that the model with the highest net benefit at a particular threshold probability has the highest clinical value."

I have never heard of net benefit analysis. Nor seen it applied to a ROC curve.

"The Nagelkerke R2 indicated that predictors in the development model explained 29.3% of the variance in 90-day employment outcome, which indicates strong effect (Cohen, 1988). The discriminative ability of the model was evaluated by the ROC curve, showing an AUC of 0.78 (SE = 0.003), indicating a strong effect size (Rice & Harris, 2005). The model demonstrated good calibration (calibration slope = 1; intercept = 0.00)."

"The development model was validated in the randomized validation dataset (n = 26,815) using the bootstrap technique."

I have no idea what the bootstrap technique is in this context. Do you hlsmith?

"The predictive accuracy, discriminative ability, and calibration quality of the development model was assessed prior to testing in the validation sample. The overall predictive accuracy of the model was good. The development model correctly classified employment outcome for 72% of clients compared to 54% in the null model."

what the heck is the null model. Assume that half the values are zero :p I don't think they are talking about a chi square test here.


Less is more. Stay pure. Stay poor.
Net Benefit Analysis type stats are interesting and can be used to address the tradeoff between FP and FN. But it is an all or nothing rule. I have always wondered if once you select the best cutoff, is there a way to rerun the analysis (e.g., logistic regression) in order to get estimates for all of the covariates or get confidence intervals from the regression. So to tell the model to use 0.64 as the cutoff instead of the default 0.5. Normograms can kind of be used in this setting, but I would like to use the actual original model with a set cutoff value. The process usually incorporates a cost matrix as well.

Below is an example I use in a lecture where I over penalize either FN or FP and score the data based on the respective matrix.



No cake for spunky
One thing I understand in theory, but don't know how to do in practice, is to use the development data set to built a model and then test it with the hold out data set. Do you manually, say in excel, use the slopes built in the development data to predict points then see what the actual points are associated with that (the observed)? Or will sas do this for you. Is there a statistic to test how well the predictions are relative to the observed?


No cake for spunky
Ok. I have to go back and look again at that code (which I got to run, Wicklin uses PROC IML so much I figured I would not be able to run it),