Positive predictive value of biased data


I am currently dealing with prediction of binary data due to structural alerts. This means if a substructure is within the query structure it is labelled as positive.
For evaluation of predictivity of the single alerts I am having datasets of >1000 structures. The problem is that most of the sets consist of approx. 60-75% of positive classified data.

Now of course i am getting a high positive predictive value being calculated as true positives/(true positives+false positives) because the pobability of guessing a structure as positive is sometimes much more than 50%. This makes it hard for me to compare the outcomes for the different datasets.

Is there any method (except for reducing the original datasets) how i could include the bias of the predicted dataset to see the actual predictivity?
Or am I totally missing on something here?

I would really appreciate your help here.


Omega Contributor
So your sample has a prevalence higher than most samples and you think your PPV estimates may be mis-leading?

If so, you have selection bias in your sampling, since it does not appear to be random from the population at large. Yes, prevalence can mess up the horizontal calculations. Can you just report this, or are you using this results yourself. Depending on things or reporting, I wonder if you can mess around with the constant term in a logistic regression model to account for this, though I am not sure you can easily get PPV out of your logistic model.

thanks for your reply!

yes thats basically my problem. I'm having samples with different prevalences and the only value i can get is the PPV because as i am matching chemical structures i only get information about which substructure matches which structure but no information about the negatives.

Now if i am having for example 76% prevalence that means if i am guessing the positives that would mean that i am getting a ppv of 76% just by chance which is very high.
Or do i just have some big mistake in my thinking?

Now I can't see how a regression could help me here im sorry.
I need to say have never used much of statistics and therefore im pretty lost here how to interpret the results correctly.
Last edited: