Suppose a traditional medical test says that the probability of a random sample of patients being positive to disease Y is 5%, but we know that the test is not accurate when identifying the positive cases. My assumption is that the real probability might be closer to 30% among the population. We have been collecting data with the traditional model for the last 5 years, but for next year, my team has developed a new experimental system to identify the incidence of Y. On a positive note, the new system is great and can identify 80% of the positive cases while the percentage of False Positive overall negative cases is only 5%. On a negative note, this experimental system is quite expensive, and we will not be able to roll it out to all the population of patients for the next year, but just to a subsample (150 out of 300 patients). Finally, all my patients next years will also be checked with the traditional testing, so I will be able to compare how the traditional and the experimental system perform against each other.

I would like to build a classification model, likely a logistic regression, to use the data from the previous 5 years and calibrate my estimator leveraging on the information collected through the new experimental model, to predict the probability of Disease Y among my patients given the patient characteristics.

Any suggestions/resources on how to approach this classification task would be great!