- Thread starter cpp_coder
- Start date

This is what the curve looks like

Knowing what type of data you have will help with a possible transformation (make it more "normal"), right now it seems lognormal (a variant of the normal bell curve) or even poisson (poisson is a curve for discrete data).

reciprocal transformation (1/x)

natural log (ln x)

In addition, you can add or subtract constants with transformations, so maybe sqrt ( 1 + x ) or sqrt ( 1 - x) or something to that extent. But be sure to examine the dataset , as the sqrt and natural log aren't defined for negative numbers.

Play around with those and keep us posted!

jamesmartinn: I have included the transformed data through sqrt

TheEcologist: the data are quantity (y) of delays (x). Each delay is the timespan between the question and answer expressed in seconds.

TheEcologist: the data are quantity (y) of delays (x). Each delay is the timespan between the question and answer expressed in seconds.

However if your data is discrete (whole seconds) this might give problems. See the attachment Poisson, where I log-transformed a dataset of counts (like whole seconds) the data becomes more "normal" but you are left with big breaks on the left side. This is to be expected with the log-transformation on discrete data. If the data was continuous it would be less of a problem (see lognormal attachment).

Now it all depends on what you want to do with your data, if you are only interested in statistics like the mean or standard deviation, use the calculations for mean/standard deviation of the log-normal or Poisson (which one you think comes closest to your data).

For the log-normal: http://en.wikipedia.org/wiki/Log-normal_distribution

Poisson: http://en.wikipedia.org/wiki/Poisson_distribution

(and go to Maximum likelihood estimation of parameters)

For the lognormal it is as simple as just log-transforming your data and calculating the mean and variance as you always would. Then back-transforming it (taking the exponent) when you present your mean or variance.

If your only goal is to make it look more normal, well then just transform the data like jamesmartinn said and post back.

The goal of this exercise is not to make the data look more normal but rather to extablish what constitute average/expected results and to compare this to the thousands different samples to determine if any of these samples are differents. The hypothesis is that through statistical analysis it might be possible to isolate those are do no behave like the others, and this difference might be tied to unusual behavior like cheating.

If anyone has any idea, I'm open to suggestion.

I'm not trying to be picky but it would help if you are clear on your objectives so that we can efficiently help you out and avoid multiple-page rambling threads.

...post #8 above from TheEcologist suggests using the mean/std dev calculations for the distribution closest to your actual data distributions - I think that will do it, yes?

...another easy way may be to compare the medians and inter-quartile ranges rather than the means and standard deviations. The median is not as sensitive to skewness as the mean.

Johnm: the goal of the exercise has not changed. we have samples consisting of delays (10-30 data points). the assumption is that we can identify which samples are differents through statistics. the problem is finding an approach to do just that.

originally a lot of energy want spent on the bell curve, doing std and avg. any data outside two std would be flagged. samples with the most flags got targetted. unfortunately the data doesn't conform with a normal curve. we also tried count of delays on on both sides of the mediane split, quartile comparisons, etc.

today I split the data in ranges and compare the difference between the avg for each range and the number of data points in each range of each sample.

originally a lot of energy want spent on the bell curve, doing std and avg. any data outside two std would be flagged. samples with the most flags got targetted. unfortunately the data doesn't conform with a normal curve. we also tried count of delays on on both sides of the mediane split, quartile comparisons, etc.

today I split the data in ranges and compare the difference between the avg for each range and the number of data points in each range of each sample.

Last edited: