Multiple comparison problem (one-to-many)-Protein sequence

Dear All,
I am new here and this is my first post.
I have a problem where I am comparing a set of 9 protein amino acid (AA) sequences (Proteins I am interested in) vs 1000 random of proteins sequence sets 9 by 9. Basically, I am trying to confirm that some AA (e.g. Alanine) are enriched in my set of interest in relation to the other sets (lets say the "normal" proteins). The question I am asking is: Is the number of Alanines in my set (9 proteins of interest) indeed higher than what would be expected?

Can someone help me with that?
Thanks in advance,
Could you plz elaborate on your design? I mean how many Alanines are there in your protein sequence? How the random sequences have been allocated? If they are random, how you call them "normal"? I think a normal protein sequence cannot be random, right?

Then what is your question? the one you stated in your title (multiple comparison problem)? or the one stated in your post (is the number of Alanines in your protein greater than expected?

And if you have 9 amino acids? or 9 proteins? I see you have said both.


As some suggestions, I think you don't have a multiple comparison case, if you are comparing a single sequence with 1000 sequences.

If you care about only the number of Alanines in your protein, well you are comparing a single ratio with 1000 other ratios. Lets say your sequence has 3 Alanines. So your Alanine ratio is 33.3%. You will check if this ratio differs from the ratios in those 1000 normal sequences... You can use a chi-square test. You have for example 3 Alanines in your 9-AA sequence. You have 1430 Alanines in your 9000-AA collective random AA bank. So your chi-square would compare 3 / 6 versus 1430 / (9000 - 1430). You won't need correcting for multiple comparisons.

You can also use a chi-square goodness of fit test.

Also instead of having 1000 random sequences, you can calculate the expected ratio by calculating the possibility of Alanines and other amino acids. So you have 9 cells, each of which can take 20 amino acids, and only one of them is Alanine.... This way, you will have a potential number of sequences with a limited number of sequences including different numbers of Alanine. You can count those Alanine-included sequences and calculate the average of them, as the expected Random value.


Is the chi-square still the best option? Or do I need something to correct for the 1 vs 1000 comparisons?
As long as you are concerned with the number of Alanines in your 9-AA sequence, you have actually only one single comparison.
Dear Victor,
thanks a lot for your reply. My main question is the one stated in my post. (Is the number of Alanines in my set (9 proteins of interest = 9 amino acid sequences of interest) indeed higher than what would be expected? However, I thought I also had a multiple comparisons problem because I am comparing one set of 9 proteins against 1000 sets of 9 proteins.

I will try to clarify the problem a bit further. I performed an experiment and found 9 proteins. When I checked their AA sequences I noticed that they had, apparently, some AA more abundant than it would be expected in relation to their abundances in a genome scale. The accumulation of certain AA would then be an indication of those proteins charge, hydrophilicity and consequently their function. With that in mind I would like to know if the number of AA I see for those 9 proteins are indeed different than what I would find for all other proteins in the genome. To test that I retrieved 30000 protein sequences (hole genome) from a databank. After that, because I have 9 protein of interest, I selected random sets of 9 protein sequences from these 30000. By doing that I expect to have a measure of AA abundance in that genome. Fox example: we know plants have 20 AA but we also know that certain AA are more or less abundant due to their physicochemical properties. We also know that different plants might even have preference for certain AA. Because of this I need a way to calculate if what I see in my 9 proteins of interest is just the way it should be or it is indeed something special. I would like to do it for all 20 AA. I am treating the 9 proteins as a group. So, for each sequence I calculated the AA ratio for each of the 20 AA and then I averaged the AA ratios for the 9 sequences.

Is the chi-square still the best option? Or do I need something to correct for the 1 vs 1000 comparisons?