I am a plant molecular biologist doing some bisulfite sequencing. I am currently using a t-test on my data, but I feel like that doesn't take all of the information into account. I have done some googling, read some stats textbooks/lab handbooks, and talked to my friendly neighborhood statistician. However, I am still a little unsure if I am using the right stats method for my data.
For the unfamiliar, bisulfite sequencing is a method of DNA sequencing that assays whether individual DNA bases are modified with a methyl group. Specifically it detects the presence of either a 5-methyl-cytosine or an unmodified cytosine at each position in a region of DNA. For our purposes, this is done on a relatively small region (200 to 1000 base pairs). Of course when we do this, we want to look at a control (wildtype) versus a mutant plant. However, one cell differs from another, so the cells from a single plant have varying amounts of methyl cytosine. So, we end up with sequences from many cells in each population (wildtype and mutant populations).
I hope a picture of the data will make things clear. Attached is a 'dot plot,' the visually appealing way that we look at this data. Each green dot represents a cytosine (the other colors are of no interest to me, they represent a different pathway). Filled dots are methylated cytosine and empty dots are unmethylated cytosine. Each column represents a single cytosine by position in the region that we are looking at. Each row is a result from an individual sequencing reaction, that is, each row represents the methylation of the cytosines in a single cell.
For an actual analysis, there would be another 'dot plot' of data for another plant(s). What I have been doing is taking the number of methylated cytosines from each row and dividing that by the total possible methylated cytosines to get a percentage. I then take all of the percentages from each population and perform a t-test (unpaired, two-tailed) between the two populations.
The only problem with this is that this completely ignores how many cytosines are in the sequence that I assay. Each row could have 5 cytosines or 50 and this would not really be factored into the stats (except that it could make the variance smaller?). Is there an easy way to factor this in to a t-test or another test?
I should probably mention that I haven't seen anybody in my field use any kind of stats for this type of data, but that obviously doesn't make it right. I'm pretty new to statistics, so any help is greatly appreciated.
Advertise on Talk Stats