# Thread: test to use when comparing two un-even data sets, non-parametric?

1. ## Re: test to use when comparing two un-even data sets, non-parametric?

Originally Posted by incubusfanclm
50 times. which is maybe why the undetermined values are set to 50.
I'm guessing that's probably why.

Any ideas on why sequence x is apparently more likely to not be present than sequence y (or at least present at detectable amounts)?

2. ## Re: test to use when comparing two un-even data sets, non-parametric?

it is more randomly spread through the genome.

so is there any way to replace the undetermined values with some number that won't skew the data? i was thinking some kind of censoring algorithm. maybe "MLE" or something like that? i'm not experienced enough with stats to do this on my own. thanks.

3. ## Re: test to use when comparing two un-even data sets, non-parametric?

Originally Posted by incubusfanclm
so is there any way to replace the undetermined values with some number that won't skew the data? i was thinking some kind of censoring algorithm.
Is it guaranteed that if you tried for long enough that you would eventually get a hit on sequence x. Say you were allowed to go indefinitely and the machine could just keep duplicating the DNA - are you guaranteeing that sequence X will be recognized eventually? If so then modeling it as censored would work - but it would only make things look worse than just treating it as 50. It might make sense to do some mixture model but I wouldn't feel comfortable proposing a model without understanding the underlying mechanics of the situation better.

maybe "MLE" or something like that? i'm not experienced enough with stats to do this on my own. thanks.
MLE stands for maximum likelihood estimation (or estimate depending on what exactly we're talking about) but that's an estimation procedure. It doesn't just magically solve problems unless we can specify a specific model of interest (and even then it doesn't just magically solve problems).

4. ## Re: test to use when comparing two un-even data sets, non-parametric?

it is NOT guaranteed that some value would show up for X even if it were allowed to duplicate infinitely, because as i said, the sequence is randomly spread. you might have picked up a piece of the genome in that particular repeat that does not contain the sequence.

altering our techniques/parameters for the assay is out of the question. i'd really just like to know if there is something mathematically that you can do with the 50's. what number can i replace them with? perhaps the average of the Y values? (since it's unlikely that any sequence would be detected higher than that) maybe two standard deviations above the average of the Y values?