Comparing data to a theoretical discrete distribution

noetsi

Fortran must die
#1
With continuous data you can test if your data meets a theoretical distribution with a QQ plot. However, SAS won't generate these for discrete distributions such as Poisson (with good reason as the author notes in the link below as the results are dubious for such a distribution). The author presents an alternative in this link. He does not say how close the theoretical PMF needs to be to the actual distribution -as with QQ plots there probably is no certain answer to that.

Note I do not know if the author's interpretation are correct. But he is a statistician who wrote a book for SAS so I am guessing this is reasonable. :p

http://blogs.sas.com/content/iml/2012/04/04/fitting-a-poisson-distribution-to-data-in-sas/
 

noetsi

Fortran must die
#3
As is usually the case I am simply learning new methods rather than running a specific exercise. I am attempting to come up with a method (which this apparently does) of testing whether a random sample I created from a poisson distribution actually fits a theoretical Poisson distribution with the same parameters as the resulting sample. This of course is what a QQ plot does but can't be used for Poisson.

Put another way I want to know if a data base [in this case randomly created] actually is the same as you would expect from a poisson distribution. This is what the link shows you how to do [assuming it is correct of course].

I can't send you the data base of course, I generated a random sample with a poisson distribution with sas code [having 500 cases]. If you want the SAS code to generate that data base I could send it, but I assume you would run it in r instead.
 

hlsmith

Not a robit
#4
I would just run the Proc genmod like he did, not too sure how sensitive it may be in the presence of large samples. Then also look at the histogram plot like he provided. I may also run the follow:

Code:
[FONT=Courier New][SIZE=2][COLOR=#000080]
[SIZE=2][FONT=Courier New][COLOR=#000080][B][SIZE=2][FONT=Courier New][COLOR=#000080]proc [/COLOR][/FONT][/SIZE][/B][/COLOR][/FONT][/SIZE][/COLOR][/SIZE][/FONT][B][FONT=Courier New][SIZE=2][COLOR=#000080][FONT=Courier New][SIZE=2][COLOR=#000080][FONT=Courier New][SIZE=2][COLOR=#000080]means [/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/B][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff]data[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Courier New][SIZE=2][FONT=Courier New][SIZE=2]=mydata [/SIZE][/FONT][/SIZE][/FONT][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff]mean [/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff]var [/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff]n[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Courier New][SIZE=2][FONT=Courier New][SIZE=2];[/SIZE][/FONT]
[/SIZE][/FONT][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff][FONT=Courier New][SIZE=2][COLOR=#0000ff]var[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Courier New][SIZE=2][FONT=Courier New][SIZE=2] n;[/SIZE][/FONT]
[/SIZE][/FONT][B][FONT=Courier New][SIZE=2][COLOR=#000080][FONT=Courier New][SIZE=2][COLOR=#000080][FONT=Courier New][SIZE=2][COLOR=#000080]run[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/B][FONT=Courier New][SIZE=2][FONT=Courier New][SIZE=2];[/SIZE][/FONT]
[/SIZE][/FONT]
to see if the mean and variance are about comparable
 

noetsi

Fortran must die
#6
Have you reviewed this article at http://www.cs.ucr.edu/~mart/177/QQ/QQ_Plots.html? It covers Q-Q plots for discrete distributions.
Yes although I understood essentially nothing they said (my math is not good enough).

This comment is one that always bemuses me. It reflects the practical reality that all or nearly all actual data sets are going to be discrete - even though we commonly treat them as interval if they meet certain criteria. Only in theoretical distributions will the data be continuous. Which suggests it is impossible to match real world data bases to theoretical distributions as is done in QQ plots. The most you can expect is the actual data will come close to the theoretical ones.

Note, in particular, that the PDF for X is always discrete, because it is based on a finite number of measurements
I was under the impression, possibly wrong, that you use pmf rather than pdf for discrete data.

A Poissonness plot is apparently another good alternative for Poisson data (an alternative to qq plot). I have not found a link to it, I am waiting for a book that covers this by Michael Friendly.
 
Last edited:

hlsmith

Not a robit
#7
Yeah I usually ascribe PDF to continuous and PMF to discrete, however we dont immediately know the context for the quote without going to the link.
 

Miner

TS Contributor
#8
Note, in particular, that the PDF for X is always discrete, because it is based on a finite number of measurements
While that may be true from a theoretical perspective, from an applied perspective it only seems to matter when the measurement system has such poor resolution that you get chunky data.


http://www.qualitydigest.com/inside/quality-insider-article/what-chunky-data.html has a good explanation of what chunky data are, and how it impacts one statistical tool used in industrial statistics.
 

noetsi

Fortran must die
#9
A similar discussion involves interval data. As I understand it formally interval should be continuous although - in the sense that continous is infinite- no real world data base will ever be that. The discussion then goes on to how many unique values a data base has to be to be interval like. This comes up a lot in the context of linear versus other forms of regression.

You deal with some interesting stuff miner. Makes me realize how limited my Six Sigma training really was :p
 

Miner

TS Contributor
#10
In many ways industrial statistics can be easier because you can immediately run verification trials. You can also do some things that might not be generally accepted elsewhere. For example, if I have Poisson data with a mean greater than 10, I have no qualms about treating it as pseudo-continuous and analyzing it using continuous methods. I will use the Freeman-Tukey transform to stabilize the variances. Again verification trials provide quick feedback and we rarely publish in journals.
 

noetsi

Fortran must die
#11
If you do anything practical a journal would not be interested :p (giving academics a hard time again).:p

In practice many academics, if not mathematicians, bend the rules as well. They generate means for ordinal data, they use linear regression when the data has at least 12 distinct levels [which is called "interval like"] and so on. And they disagree with each other massively - including in journals which make fundamentally contrary comments on these types of issues. This is part of the reason I am learning simulation methods in the first place.

As you say it helps having a closed system in which you can test your methods.

The author of the link above, who is a statisician working at SAS, says that poisson data with a mean of 7 has a normal distribution. With enough levels [12 maybe?] I would think that it should be effectively possible to analyse it with linear regression or ANOVA say. I am not sure how this works out with the variance - a Poisson distribution has a specific variance equal to its mean.
 

Miner

TS Contributor
#12
The author of the link above, who is a statisician working at SAS, says that poisson data with a mean of 7 has a normal distribution. With enough levels [12 maybe?] I would think that it should be effectively possible to analyse it with linear regression or ANOVA say. I am not sure how this works out with the variance - a Poisson distribution has a specific variance equal to its mean.
In practice, it does work well for the applications in which I have used it. I do use the Freeman-Tukey transformation for count data to stabilize the variances. See http://support.minitab.com/en-us/mi...alculator-functions/transform-count-function/
 

noetsi

Fortran must die
#13
That is interesting I would never have guessed that. I don't know the Freeman Tukey transformation - will have to read up on that.
 

BGM

TS Contributor
#14
If you want to test a certain data fit a theoretical discrete distribution, why not try the Chi Square goodness-of-fit test?
 

noetsi

Fortran must die
#15
If you want to test a certain data fit a theoretical discrete distribution, why not try the Chi Square goodness-of-fit test?
Because the test of whether a distribution fits the data I have seen emphasize that they have significant issues (aka problems) tied to power and that graphical methods such as QQ plots are generally preferred. That is these test tend to reject the null far more than they should when you have large samples [I know this is generally true, but power issues are raised more often in the context of such tests].

The commentary is so negative, based on my experience, that I generally avoid such tests. Note that I don't know if the tests I have seen are the chi square test you mention, but chi square has well known problems with power based on my SEM classes :p
 

Dason

Ambassador to the humans
#16
You really need to watch your language. The tests don't reject more often than they should with large samples. Some people get pissy when they have large samples because the test has high power but they are running a test that they hope won't reject the null. This isn't a fault of the test. This is a fault of the person running the test and their misunderstanding of the results.
 

noetsi

Fortran must die
#17
All I can say dason is that these test are regularly panned by commentators as being deceptive and graphical alternatives stressed instead because of this issue. If you are saying this criticism is invalid that is interesting to know.

To me it is somewhat concerning that if I have the same exact sample distribution I will sometimes reject it and sometimes not exclusively because of sample size. And yes I know this is an issue with other methods, but from what I have read it is more of an issue with these tests than other ones] .:p

An interesting take on this subject.

http://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless
 
Last edited:

Miner

TS Contributor
#18
This brings up an issue that appears to be rampant in a lot of fields. I teach my Six Sigma students to determine a practically significant effect size in advance then use that to calculate the appropriate sample size for a given Power/alpha risk. I define practically significant as that effect size that would prompt their boss to authorize the funds necessary to make a permanent change or to fund additional experiments. Otherwise, any effect, even if statistically significant is of no practical benefit. In industry, it's mostly about the money, not about publications in journals.
 

noetsi

Fortran must die
#19
A related concept, which I run into at work all the time, is that much of the focus in the literature [at least in social sciences] is on test of statistical signficance. Not on whether the effect size is meaningful - an answer that really can't be determined by statistics. I get asked all the time - is this signficant [when they mean does it really matter not is it statistically signficant which is a totally different issue where sample size etc -aka power- can be far more important than effect size]. I think test of the null end up being a cul de sac generated for the best of reasons, but which create serious problems in dealing with an audience that 1) does not know stats at all and 2) wants you to decide if the effect is important rather than they [with the magic of stats].