Best Method to Calculate P-Values for These Data?

Hello. I am a first year chemistry Ph.D. student with very little statistics experience. I am working with biophysical data, performing statistical tests to explore a hypothesis regarding the relationship between codon translation rates and domain boundaries of multidomain proteins.

I have two populations of codon translation rate data. They are all in terms of translation time. The first population is the codon translation time for each codon of the transcriptome. The second population is the codon translation time for each codon at a given position from the domain boundary. All translation times are divided by the average synthesis time of the transcript. More or less, population two is a smaller subset of population one, relative to a position on the transcript.

My advisor and I were discussing the best way to calculate p values for each of the codons in the second population. Our goal is to see if there is a significant difference between population two and population one. Some methods that we considered were two-sided t tests, paired t tests, Wilcoxon signed rank sum test, and the Mann-Whitney U test. The paired t test and Wilcoxon signed rank sum test will not work because population two is smaller than population one. The two-sided t test probably would not be ideal, because we have a relatively small sample size for the second population, so it's unclear to me whether we could assume normal distribution. The Mann-Whitney U test seems like the best test to apply to these data, but given my limited statistics background, I do not feel qualified to make this judgment. Any input would be greatly appreciated! I can provide more information as it becomes necessary.

Thank you!


New Member
I have two populations of codon translation rate data.
I think it's important to note that they are both samples, not populations. Tests of statistical significance are only needed if you're dealing with at least one sample. To compare two populations, you would just directly compare their means - no fancy tests needed. (I only make this point because, as you are googling and reading about different tests, having the terminology straight will make things easier).

More or less, population two is a smaller subset of population one
Assuming I understand your experiment, an independent-samples t-test will work, but in order to use it, you would need to structure your data like this:

Sample 1 - All codons of a particular distance from the domain boundary
Sample 2 - All OTHER codons measured

In other words, although you may think of the sample 1 codons as a subset of sample 2, you could NOT include the measurements of these codons in both sample 1 and sample 2 if you go with independent-samples t-test.

There may be some type of test out there that treats one sample as a subset of the other sample, but if so I'm not aware of it.

Exactly how small is the group that I'm calling sample 1? Are there ~25 codons or more in that group?
Thank you so much for your prompt reply! I apologize for my delayed response. With orientation beginning this week, I haven't had much time to spare for my research.

I apologize, because I realize that my description of this research project was not very detailed. We have a series of mRNA transcripts. For each transcript, we have information regarding the placement of the domain(s), as well as codon translation times for each codon along the transcript.

The first data set I was describing is a set of all codon translation times. There are 16,210 values in this list, with a minimum of 0.0020185 and a maximum of 30.0112087 with an average value of 1.00.

The second data set is different depending on the position relative to the domain boundary. Essentially, the domain boundary is defined to be i=0, then we iterate this from -200 to +200. For all transcripts that have codons at the given position, the codon translation time is added to the data set. For this reason, any given i value may have anywhere from ~30 to ~130 values. For each i-value, we want to determine if this data set differs significantly from the data set of all codon translation times.

We initially did this without p-values. We calculated the average for each i-value data set, plotted these averages, and computed the 95% confidence interval with the BCa bootstrapping method. If these error bars did not cross the value of 1.00 (the average of the first data set), we labeled them as "statistically significant."

Now, we would like to perform Multiple Hypothesis Correction, which requires p-value correction. I hope that clarifies the question a bit! Thanks again for your reply; it has been very helpful. I do apologize for my delayed response. Please let me know if you need any more information!