In my dataset, I have a set of metrics for each code block. For a comparison, I classified the code blocks into two groups. One group uses a specific programming language feature and the other does not. The null hypothesis in each case would be that there is no difference in the metrics between the two.

Note that the two groups are of different sizes (~200k vs 800k) as most code blocks do not use a certain feature. This also means that the larger group tends to settle around the average for all metrics.

At the moment, I use a Mann-Whitney U test to compare each metric of the two groups for each feature. What I get with using scipy are p-values which round to zero (around e-120 or literally 0 in Python) and exploding values for the statistic.

What I already tried is to use random samples of the same size from the two groups, which resulted in similar values when using the Wilcoxon signed-rank test.