Calculating t-test and correlation/importance of feature with aggregated dataset

#1
Hi,

Important note: I will do the processes below with ready functions/packages in tools, and I am not a professional statistician, so I would really appreciate if you can help with the question below with simplified explanation.

I have a dataset with a couple of features as "Product Risk Level" and "Customer Risk Score" and an outcome as "Customer Default" as below.
I want to calculate the significance or correlation of "Product Risk Level" and "Customer Risk Score" for the outcome "Customer Default".

The sample size is totally 1300 aggregated rows (600 rows for the "Yes"-outcome and 700 rows for the "No"-outcome). I just shared an example below with made up numbers.

I am planning to do first t-test to check the independency between the features and the outcome. Then I will calculate either the correlation between each feature and the outcome. Or I will calculate the importance of each feature by using a "classification model".

However, unfortunately my dataset is not "per-event". My dataset has aggregated values which consists of the population/number of samples in each features-outcome-peer.

I believe that I need to take the "Number of samples" into consideration when I do the t-test and correlation and/or "importance of feature". The question is how?


1623620768959.png
 
Last edited:

Karabiner

TS Contributor
#2
For a t-test, you need the variability of the scores (standard deviation in each pair-group).
As you do not have this, you'll have to analyse the data on the aggregate level.
But since your sample size for such an analysis is small (n=7 groups), a t-test for the
risk score, or a U-test for the risk level (which clearly is ordinal scaled and would not
permit a t-test) does not seem very useful. They would have extremely low statistical power
to detect any effect.

Of course, you can do descriptive statistics, for example calculate the weighted mean
(or median, respectively) of "outcome: yes" versus "outcome: no".

Just my 2pence

Karabiner
 
#3
For a t-test, you need the variability of the scores (standard deviation in each pair-group).
As you do not have this, you'll have to analyse the data on the aggregate level.
But since your sample size for such an analysis is small (n=7 groups), a t-test for the
risk score, or a U-test for the risk level (which clearly is ordinal scaled and would not
permit a t-test) does not seem very useful. They would have extremely low statistical power
to detect any effect.

Of course, you can do descriptive statistics, for example calculate the weighted mean
(or median, respectively) of "outcome: yes" versus "outcome: no".

Just my 2pence

Karabiner
Sorry @Karabiner , I forgot to mention that my sample size is actually total 1300 aggregated rows (600 with Yes and 700 with No), I just shared here a made-up sample. In that case, is there a way of calculating the "importance of feature" and doing a test like t-test?
 
Last edited:

Karabiner

TS Contributor
#4
So, as far as I can see, you can perform a t-test on the group level data with risk score, and U test or a Chi² test with risk level.
I do not know whether it would be useful and possible to take group size into account, maybe someone else does.

It would be possible to use both measurements to jointly predict the outcome, but that would be a bit more complicated
(binary logistic regression).

With kind regards

Karabiner
 
#5
A correlation can be estimated between two numerical values (for example, age and salary) or two categories of values (for example, age and income) (e.g., type of product and profession). A firm, on the other hand, would seek to calculate correlations between several types of variables.
Converting a numerical variable into categories is one way to calculate the correlation of a numerical variable with a category variable. Age, for example, would be divided into ranges (or buckets) like 18 to 30, 31 to 40, and so on.
The covariance of two variables is frequently calculated in addition to the correlation. The covariance can take any integer value, unlike the correlation value, which must be between 1 and 1. The covariance represents the degree of synchronization between the two variables' variance (or volatility).