+ Reply to Thread
Results 1 to 11 of 11

Thread: bell curve question

  1. #1
    Points: 3,152, Level: 34
    Level completed: 68%, Points required for next Level: 48

    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts

    bell curve question




    I have data distributed unevenly on both side of the average.How do I go about normalizing the data so that it makes a perfect curve? I don't have a background in statistics that why I'm looking for pointers as to how to proceed with this. Ultimately, I need to find out if a sample (which have on average 15-25 data points ) deviate from what is "normal". Any suggestions would be appreciated.

  2. #2
    Points: 20,006, Level: 89
    Level completed: 32%, Points required for next Level: 344

    Posts
    568
    Thanks
    50
    Thanked 20 Times in 19 Posts
    If you could post an image of what it your graph looks like, it would be helpful. No distribution is ever perfectly normal in the field, so if it's a bit off don't worry too much. But if there are some obvious problems, there are likely methods of remedying them.(ie., transformations, outliers, skew, etc).

  3. #3
    Points: 3,152, Level: 34
    Level completed: 68%, Points required for next Level: 48

    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    This is what the curve looks like
    Attached Images  

  4. #4
    R purist
    Points: 35,103, Level: 100
    Level completed: 0%, Points required for next Level: 0
    TheEcologist's Avatar
    Location
    United States
    Posts
    1,921
    Thanks
    303
    Thanked 607 Times in 341 Posts
    Quote Originally Posted by cpp_coder View Post
    This is what the curve looks like
    The data seems bounded at zero and it looks discrete (but this might be caused by the graphing algorithm). So what type of data do you have?

    Knowing what type of data you have will help with a possible transformation (make it more "normal"), right now it seems lognormal (a variant of the normal bell curve) or even poisson (poisson is a curve for discrete data).
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  5. #5
    Points: 20,006, Level: 89
    Level completed: 32%, Points required for next Level: 344

    Posts
    568
    Thanks
    50
    Thanked 20 Times in 19 Posts
    Try taking the square root of the data then graphing it.

  6. #6
    Points: 3,152, Level: 34
    Level completed: 68%, Points required for next Level: 48

    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    jamesmartinn: I have included the transformed data through sqrt

    TheEcologist: the data are quantity (y) of delays (x). Each delay is the timespan between the question and answer expressed in seconds.
    Attached Images  

  7. #7
    Points: 20,006, Level: 89
    Level completed: 32%, Points required for next Level: 344

    Posts
    568
    Thanks
    50
    Thanked 20 Times in 19 Posts
    from the looks of it, it doesn't seem like the sqrt transformation did much. You can investigate others such as:

    reciprocal transformation (1/x)
    natural log (ln x)

    In addition, you can add or subtract constants with transformations, so maybe sqrt ( 1 + x ) or sqrt ( 1 - x) or something to that extent. But be sure to examine the dataset , as the sqrt and natural log aren't defined for negative numbers.

    Play around with those and keep us posted!

  8. #8
    R purist
    Points: 35,103, Level: 100
    Level completed: 0%, Points required for next Level: 0
    TheEcologist's Avatar
    Location
    United States
    Posts
    1,921
    Thanks
    303
    Thanked 607 Times in 341 Posts
    Quote Originally Posted by cpp_coder View Post
    jamesmartinn: I have included the transformed data through sqrt

    TheEcologist: the data are quantity (y) of delays (x). Each delay is the timespan between the question and answer expressed in seconds.
    When you have long tails on the right side of your data the transformation of choice is the log transformation as jamesmartinn said: ln(x+1).

    However if your data is discrete (whole seconds) this might give problems. See the attachment Poisson, where I log-transformed a dataset of counts (like whole seconds) the data becomes more "normal" but you are left with big breaks on the left side. This is to be expected with the log-transformation on discrete data. If the data was continuous it would be less of a problem (see lognormal attachment).

    Now it all depends on what you want to do with your data, if you are only interested in statistics like the mean or standard deviation, use the calculations for mean/standard deviation of the log-normal or Poisson (which one you think comes closest to your data).

    For the log-normal: http://en.wikipedia.org/wiki/Log-normal_distribution
    Poisson: http://en.wikipedia.org/wiki/Poisson_distribution
    (and go to Maximum likelihood estimation of parameters)
    For the lognormal it is as simple as just log-transforming your data and calculating the mean and variance as you always would. Then back-transforming it (taking the exponent) when you present your mean or variance.

    If your only goal is to make it look more normal, well then just transform the data like jamesmartinn said and post back.
    Attached Thumbnails Attached Thumbnails Click image for larger version

Name:	lognormal.jpeg‎
Views:	20
Size:	22.5 KB
ID:	237   Click image for larger version

Name:	poisson.jpeg‎
Views:	17
Size:	55.4 KB
ID:	238  
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  9. #9
    Points: 3,152, Level: 34
    Level completed: 68%, Points required for next Level: 48

    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I have included in the attachments the log (113) and sqrt (114) transformed data. that doesn't seem to really help find a valid avg and std.

    The goal of this exercise is not to make the data look more normal but rather to extablish what constitute average/expected results and to compare this to the thousands different samples to determine if any of these samples are differents. The hypothesis is that through statistical analysis it might be possible to isolate those are do no behave like the others, and this difference might be tied to unusual behavior like cheating.

    If anyone has any idea, I'm open to suggestion.
    Attached Images

  10. #10
    TS Contributor
    Points: 17,636, Level: 84
    Level completed: 58%, Points required for next Level: 214
    JohnM's Avatar
    Posts
    1,948
    Thanks
    0
    Thanked 6 Times in 5 Posts
    OK, but your original post specifically says that you want to "normalize" the data and produce a "perfect curve" and find out which samples deviate from a "normal." Now you've changed the goal to be an exercise in comparing average/expected results....

    I'm not trying to be picky but it would help if you are clear on your objectives so that we can efficiently help you out and avoid multiple-page rambling threads.

    ...post #8 above from TheEcologist suggests using the mean/std dev calculations for the distribution closest to your actual data distributions - I think that will do it, yes?

    ...another easy way may be to compare the medians and inter-quartile ranges rather than the means and standard deviations. The median is not as sensitive to skewness as the mean.

  11. #11
    Points: 3,152, Level: 34
    Level completed: 68%, Points required for next Level: 48

    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Johnm: the goal of the exercise has not changed. we have samples consisting of delays (10-30 data points). the assumption is that we can identify which samples are differents through statistics. the problem is finding an approach to do just that.

    originally a lot of energy want spent on the bell curve, doing std and avg. any data outside two std would be flagged. samples with the most flags got targetted. unfortunately the data doesn't conform with a normal curve. we also tried count of delays on on both sides of the mediane split, quartile comparisons, etc.

    today I split the data in ranges and compare the difference between the avg for each range and the number of data points in each range of each sample.
    Last edited by cpp_coder; 05-30-2008 at 12:01 PM.

+ Reply to Thread

           




Similar Threads

  1. Replies: 0
    Last Post: 04-04-2011, 02:47 AM
  2. [Excel] Help needed with converting data to bell curve
    By joelhuang in forum Other Software
    Replies: 4
    Last Post: 11-25-2010, 10:59 AM
  3. Creating a Bell Curve with 3 performance parameters
    By AB101010 in forum Statistics
    Replies: 0
    Last Post: 05-17-2010, 08:28 AM
  4. Replies: 3
    Last Post: 01-17-2010, 06:58 PM
  5. Marks on a bell curve
    By UofA in forum General Discussion
    Replies: 1
    Last Post: 11-12-2005, 12:39 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats