R caret package: problems understanding the functions in confusionMatrix

#1
Hey,

I earlier posted about the problematic interpretation of PPV (positive predictive value). As there seems to be no easy solution for that I tried to understand what the PPV could tell me and especially what it cannot.

Now I was calculating confusion matrices for predictions of the model i built. To do that I came across the caret package which easily gives you along with the confusion matrix measures like sensitivity, specificity, PPV alongside others.

The problem I have now are the formulas for calculating the PPV, Precision, Sensitivity and Recall. (The actual description of the Formulas are attached as quote at the end of the post)

The first problem is with PPV and Precision:

I learned the calculation of the PPV as

true positives/ (true positives+false positives)

the package does call this formula precision (which I always thought is similar to PPV) and calculates the ppv from sensitivity specificity and prevalence.

In my case the prevalence is taken from the model data itself, thus seemingly PPV and precision are identical.
So for me it looks like in this case there is only a difference of precision and PPV if I am using the actual prevalence of the population and not the prevalence of my sample.
Is this always the case if someone is referring to these two measures?

The second problem is with senisitivity and recall:

The Formula for Sensitivity and Recall seem to be similar. Is there a difference of these two measures?

Could maybe someone help me to get out of this confusion?
I never really got the hang of statistics and the longer I am trying to sort this problem out the more confusing it gets for me.

Thanks a lot!


From documentation of caret package:

The functions requires that the factors have exactly the same levels.
For two class problems, the sensitivity, specificity, positive predictive value and negative predictive
value is calculated using the positive argument. Also, the prevalence of the "event" is computed
from the data (unless passed in as an argument), the detection rate (the rate of true events also
predicted to be events) and the detection prevalence (the prevalence of predicted events).

Suppose a 2x2 table with notation

referenceEvent referenceNo Event
predicted Event A B
predicted No Event C D

The formulas used here are:

Sensitivity = A/(A + C)
Specif icity = D/(B + D)
P revalence = (A + C)/(A + B + C + D)
P P V = (sensitivity∗prevalence)/((sensitivity∗prevalence)+((1−specif icity)∗(1−prevalence)))
NP V = (specif icity∗(1−prevalence))/(((1−sensitivity)∗prevalence)+(specificity)∗(1−prevalence)))
DetectionRate = A/(A + B + C + D)
DetectionP revalence = (A + B)/(A + B + C + D)
BalancedAccuracy = (sensitivity + specif icity)/2
P recision = A/(A + B)
Recall = A/(A + C)
F1 = (1 + beta2) ∗ precision ∗ recall/((beta2∗ precision) + recall)
 

hlsmith

Omega Contributor
#2
Please rephrase your question a little if I don't exactly answer it.


Yeah, I believe you have everything right. The usage of Recall, Precision, etc. come from the Classifications Fields and Machine Learning. They are just synonymous terms to what you are used to and I believe the caret package has other machine learning procedures, so it opts defaults to those terms.


Not sure why they just don't use A / (A + B) in the precision formula, but just superficially looking at the above formula it seems fine.
 
#3
Hey,

thanks a lot. Sorry for the late reply i was on holidays ;)

Yes it is actually a machine learning package, so ut seems that I have to get used to the different terms ;)

The only thing I am wondering about is the Formula for the PPV and Precision.

I learned that (apart from them being the same) the Formula as:

PPV = A/(A + B)

but it seems that the formula they use is also commonly used for calculation of PPV as

PPV = (sensitivity∗prevalence)/((sensitivity∗prevalence)+((1−specif icity)∗(1−prevalence)))

Is there any rule of when to use which Formula?

Thank you for your reply!