which optimal bin width formula to choose for histogram

#1
For histogram construction, I learned from [Scott 79] that for Gaussian distributed samples, 3.49*sigma*N^(-1/3) is optimal as the bin width. However, for other more general distributions with more distorted shapes, such as a mixture of Gaussians, or with pdf shape of approximately generalized Gaussian type, is there any optimal bin width formula to use?
 
#2
There's no hard-and-fast answer to this one. The default in R is Sturges' Rule: k = 1 + log2(n), where k is the desired number of equally-spaced bins and n is the sample size, which works surprisingly well for a lot of samples but can also break badly in some cases. The Wikipedia page gives a list of alternatives:

http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width

FWIW, I've found Scott's rule (which is also an option in R's hist() function) to be best overall, but YMMV.
 
#3
Thanks a lot. I don't have access to the orginial paper on Scott's rule. I wonder if there is any analytical results on the accuracy of this or other methods, such as MSE of the estimated pdf?
 

Dason

Ambassador to the humans
#4
Is there a reason you would prefer a histogram over an estimated density using something like a kernel density estimate?
 
#5
I am not familiar with kernal density estimation yet. Is KDE always better than histogram? If use KDE, without knowing the distribution, how do I know which kernal function to choose, and what bandwidth to use?

Thanks
 

Dason

Ambassador to the humans
#6
The Gaussian or Epanechnikov kernels are typically used. The bandwidth question is similar to the optimal bin width question. It depends partially on what you think the underlying density is like but there are some algorithms that give nice properties asymptotically. What software are you using?
 
#7
Then, comparing histogram and KDE, is there any condition that can indicate which method is better to use?

I am using matlab. Is there any built-in function to determine bandwidth for KDE?

Does Gaussian as kernal mean decomposing the pdf as mixture of Gaussians?
 

Dason

Ambassador to the humans
#8
Then, comparing histogram and KDE, is there any condition that can indicate which method is better to use?
I'm partially under the impression that KDE is almost always better. If you're trying to estimate a continuous density then it makes sense to me to use one of the default kernels that is used in KDE because they work pretty well.
I am using matlab. Is there any built-in function to determine bandwidth for KDE?
I don't know. Maybe?
Does Gaussian as kernal mean decomposing the pdf as mixture of Gaussians?
Well - I wouldn't say you're decomposing the pdf as a mixture of Gaussians because you don't have a density to decompose in the first place. You're estimating the density as a mixture of Gaussians yes. But with a histogram you're estimating the density as a mixture of... uniforms. I think in most cases it makes more sense to use something like a mixture of Gaussians (although there have been shown to be really nice properties associated with the Epanechnikov kernel as well.
 

noetsi

Fortran must die
#10
The only reason to use a histogram is that it is commonly utilized in industry and is intuitively obvious including to many who wont be interested (or understand)KDF. :p
 

Dason

Ambassador to the humans
#11
Well not really because you aren't centering the uniform around the observed values. But it is attempting to do the same thing.
 
#13
If it matters there are different rules built into software to determine the number of bins and width. I can find those.
I am trying to estimate pdf from a few hundreds of data points, or even less. For optimal bandwidth of KDE, is there any simple formula to use, like Scott's rule for histogram so that an online searching is not needed? Thanks for any hints.
 
#14
Well not really because you aren't centering the uniform around the observed values. But it is attempting to do the same thing.
Then, for KDE with uniform as kernal funciton, is there a good bandwidth formula that one could say in most cases perform no worse than histogram using Scott's rule?
 

noetsi

Fortran must die
#15
I am trying to estimate pdf from a few hundreds of data points, or even less. For optimal bandwidth of KDE, is there any simple formula to use, like Scott's rule for histogram so that an online searching is not needed? Thanks for any hints.
Sorry I only know of histograms not KDF.
 

Dason

Ambassador to the humans
#16
Then, for KDE with uniform as kernal funciton, is there a good bandwidth formula that one could say in most cases perform no worse than histogram using Scott's rule?
I don't know. I did a research paper on kernel density estimation a while back but I don't remember too much about choosing the bandwidth. I think I explored choosing the bandwidth through cross validation.
 
#17
From wikipedia page on KDE, it seems KDE is not as robust as histogram, although it produces functions analytically more convenient. Not sure if this is true.