bell curve question

#1
I have data distributed unevenly on both side of the average.How do I go about normalizing the data so that it makes a perfect curve? I don't have a background in statistics that why I'm looking for pointers as to how to proceed with this. Ultimately, I need to find out if a sample (which have on average 15-25 data points ) deviate from what is "normal". Any suggestions would be appreciated.
 
#2
If you could post an image of what it your graph looks like, it would be helpful. No distribution is ever perfectly normal in the field, so if it's a bit off don't worry too much. But if there are some obvious problems, there are likely methods of remedying them.(ie., transformations, outliers, skew, etc).
 
#4
This is what the curve looks like
The data seems bounded at zero and it looks discrete (but this might be caused by the graphing algorithm). So what type of data do you have?

Knowing what type of data you have will help with a possible transformation (make it more "normal"), right now it seems lognormal (a variant of the normal bell curve) or even poisson (poisson is a curve for discrete data).
 
#6
jamesmartinn: I have included the transformed data through sqrt

TheEcologist: the data are quantity (y) of delays (x). Each delay is the timespan between the question and answer expressed in seconds.
 
#7
from the looks of it, it doesn't seem like the sqrt transformation did much. You can investigate others such as:

reciprocal transformation (1/x)
natural log (ln x)

In addition, you can add or subtract constants with transformations, so maybe sqrt ( 1 + x ) or sqrt ( 1 - x) or something to that extent. But be sure to examine the dataset , as the sqrt and natural log aren't defined for negative numbers.

Play around with those and keep us posted!
 
#8
jamesmartinn: I have included the transformed data through sqrt

TheEcologist: the data are quantity (y) of delays (x). Each delay is the timespan between the question and answer expressed in seconds.
When you have long tails on the right side of your data the transformation of choice is the log transformation as jamesmartinn said: ln(x+1).

However if your data is discrete (whole seconds) this might give problems. See the attachment Poisson, where I log-transformed a dataset of counts (like whole seconds) the data becomes more "normal" but you are left with big breaks on the left side. This is to be expected with the log-transformation on discrete data. If the data was continuous it would be less of a problem (see lognormal attachment).

Now it all depends on what you want to do with your data, if you are only interested in statistics like the mean or standard deviation, use the calculations for mean/standard deviation of the log-normal or Poisson (which one you think comes closest to your data).

For the log-normal: http://en.wikipedia.org/wiki/Log-normal_distribution
Poisson: http://en.wikipedia.org/wiki/Poisson_distribution
(and go to Maximum likelihood estimation of parameters)
For the lognormal it is as simple as just log-transforming your data and calculating the mean and variance as you always would. Then back-transforming it (taking the exponent) when you present your mean or variance.

If your only goal is to make it look more normal, well then just transform the data like jamesmartinn said and post back.
 
#9
I have included in the attachments the log (113) and sqrt (114) transformed data. that doesn't seem to really help find a valid avg and std.

The goal of this exercise is not to make the data look more normal but rather to extablish what constitute average/expected results and to compare this to the thousands different samples to determine if any of these samples are differents. The hypothesis is that through statistical analysis it might be possible to isolate those are do no behave like the others, and this difference might be tied to unusual behavior like cheating.

If anyone has any idea, I'm open to suggestion.
 

JohnM

TS Contributor
#10
OK, but your original post specifically says that you want to "normalize" the data and produce a "perfect curve" and find out which samples deviate from a "normal." Now you've changed the goal to be an exercise in comparing average/expected results....

I'm not trying to be picky but it would help if you are clear on your objectives so that we can efficiently help you out and avoid multiple-page rambling threads.

...post #8 above from TheEcologist suggests using the mean/std dev calculations for the distribution closest to your actual data distributions - I think that will do it, yes?

...another easy way may be to compare the medians and inter-quartile ranges rather than the means and standard deviations. The median is not as sensitive to skewness as the mean.
 
#11
Johnm: the goal of the exercise has not changed. we have samples consisting of delays (10-30 data points). the assumption is that we can identify which samples are differents through statistics. the problem is finding an approach to do just that.

originally a lot of energy want spent on the bell curve, doing std and avg. any data outside two std would be flagged. samples with the most flags got targetted. unfortunately the data doesn't conform with a normal curve. we also tried count of delays on on both sides of the mediane split, quartile comparisons, etc.

today I split the data in ranges and compare the difference between the avg for each range and the number of data points in each range of each sample.
 
Last edited: