Reverse Engineering a calculator

#1
Hi

A colleague in a different department posed an interesting question that I've been giving more thought to lately. Basically, a scientist here has developed an online calculator for predicting risk of an event. You can go in an enter attributes (age, sex, etc) and it will spit out a risk score.

I'm convinced on the backend it's using some sort of a multivariable logistic regression. If I wanted to reverse engineer the regression equation, how might I do that (theoretically)? I have no real interest in determining it, and in fact I can probably locate the published paper the calculator was created with and view the estimates, but conceptually, if I didn't have access to it, how might I do that?

Assume there are 4 inputs - age(categorical), sex(binary), height(continuous), and prior history of some disease (binary).

I imagine using the calculator several times under a variety of situations, possibly leaving one particular input at a time, while setting the other 3 to zero. Can anyone help formulate how I might recover the regression parameters? I'm really curious now!

Cheers,
 
Last edited:

hlsmith

Omega Contributor
#2
I did this same thing about two years ago, to apply new guidelines for risk to historic patients to forecast how many may need to change their pharmaceuticals use. So you end up scoring new data, building the model and examining it on different possibilities. My model had a very small error compared to the calculator, which I couldn't determine if it was a rounding or perhaps flooting point issue (the models had many terms, so it was difficult to figure out)? The error only impacted a few people at the hundredths place in risks. So after I built it, I could run data through their calculator and then mine to compare.


In particular, I did it for this heart risk calculator: http://www.cvriskcalculator.com/
I used their model coefficients and had to know the model's structure, e.g., what data transformations they used and interaction terms. Their calculator is actually built from 4 logistic regressions if I remember correctly, since they stratified patients by risk due to interactions. Luckily for me they published a committee document on the model that was like 26 pages long.


This was our paper: https://www.ncbi.nlm.nih.gov/pubmed/27026635