Dear community members,

I have a problem to solve from data in the protein universe. I am working with the sum total of all ~30K proteins for a species (=proteome)

1. Each protein has a certain length (text string, using 20 letter alphabet) and EACH protein has a different length across this ~30K dataset.

2. EACH protein has >= 1 window(s) of type 'X' (data already computed)

3. EACH protein ALSO has > = 0 windows of type 'Y' (data already computed)

Window = Region in protein of length >20 letters (just making the point that window is NOT single letter but a longer stretch of letters, BUT < length of that protein in which it is found)

Null hypothesis: When 'Y' type window(s) is/are observed, it is statistically under-represented in protein regions containing the 'X' type windows (at the whole proteome dataset level).

What sort of stats approach to accept or reject my null hypothesis?