Confidence interval from st. error when the mean must be positive

#1
Hello all,

(Statistical newbie here - learning some basic concepts)

Some background (you can probably skip to the 'important bit', but just in case...): I am calculating the mean number of stratospheric sudden warmings (SSWs) per year. An SSW is a winter atmospheric event which happens roughly 6 times a decade. We don't have many of them on record. The dataset I have has 29 SSWs over 45 years. This gives a mean of 0.64 per year. For a given winter, there are zero events about half the time, one event about 40% of the time and two events about 10% of the time. But there is a lot of variability in this. For example, there were no SSWs at all between 1990 and 1998!

This means the sampling error on the estimate of the mean is quite large. I can calculate the standard error simply enough. I have a table going from the year 1958 to 2002, with the number of events in each year. A typical string the number of events per winter is: 2 1 0 2 1 1 2 0 1 0 0 0 0 0 1. I calculate the standard error of the mean by calculating the standard deviation and dividing it by the square root of the number of independent observations (the number of years).

So far so good.

* Important bit *: I understand how the standard error (SE) of the mean relates to the confidence interval, so that the 95% interval is approx. +/- 2*SE. All well and good. But depending on the dataset I use, I can obtain a standard error where this confidence interval crosses zero. I interpret this as an indication that the sampling error is so large we can't even be sure whether the true mean is positive!

The problem is, this is physical nonsense. We can't have a negative number of events per year. Realistically (for physical reasons that aren't too important) we will probably only ever see 0, 1, 2 or 3 events in each year.

Even though the estimate of the mean can be assumed to be normally distributed, we have an additional piece of information: that is must be positive. How do I work this into my confidence interval? Otherwise I have error bars crossing zero which makes no physical sense. Any advice would be much appreciated.
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
These seem like count data, and have you actually tested normality. If it is not feasible for the CI to be negative some people will just put zero, then the interpretation would be a positive skewed dispersion with perhaps a heavy zero count. Hopefully others chime in, otherwise verify this in literature orwith how others in your field present their data.
 
#3
The data look somewhat like a positive-only normal distribution. I've attached some plots to help:

ssws - this just shows the raw data, i.e. the number of SSW events in each year.
ssws2 - this shows the number of SSWs occurring in the period 1958-2002 split by month. I basically want to put a confidence interval on this bar chart.
ssws3 - this shows the frequency with which we observe 0, 1 annd 2 SSW events in a given year. I know it's only 3 data points, but it could be said to resemble a positive-only normal distribution.

Unfortunately, the literature isn't a great help on this. Usually they calculate standard error and confidence interval using the method above. They shy away from putting the confidence interval bars on a plot like 'ssws2' because, for example, the sample size for November events is so small their confidence interval will cross zero. I'm trying to work out whether there is a more appropriate way of obtaining a confidence interval.
 
#4
I think this is an interesting thread.

Suppose if the data are Poisson distributed.

Is it anybody who has got a suggestion for the confidence interval?
 

jpkelley

TS Contributor
#5
How do I work this into my confidence interval?
You and others who have responded have happened upon a phenomena that is downright epidemic in almost every field. Pick up almost any peer-reviewed journal and you'll see that confidence intervals often cross zero (e.g. in my field, for counts of particular behavior or for bounded distributions). Authors often try to get around this by reporting standard errors. So, you're exactly right that it makes no sense at all.

During my recent Ph.D. dissertation defense, an old-school field ecologist complained that he didn't believe how my plotted raw data (repeated-measures of individuals--with unbalanced data--which were plotted for convenience only using connected lines) ended up showing a strong effect when modeled using a generalized additive mixed-effect model. My rather snarky response involved something about the ability of people to take plotted raw data and do complex math in their heads. OK, my poor form aside, I made the point then--and I'm making it now--that most complex data (data with structure, such as within-level replicates, etc.) shouldn't be plotted outright. This leads us to the problem you noticed. There are easy ways of taking non-normal distributions (other than log-transforming them outright) such as GLMMs and GLMs that allow you to specify the kinds of distribution of your response data. So, in order to generate good confidence intervals, all you need to do is run intercept-only models (i.e. no fixed effects), and you'll end up with good estimates for your distribution. For plotting purposes, I often simulate values from this distribution so I then can generate a boxplot.

Anyway, we can discuss this more. Unless your data are very clean (not a usual feature of natural systems), plotting raw data is often misleading. Model the data, get the output and visualize from simulations. Anyway, you nailed this issue on the head.
 

jpkelley

TS Contributor
#6
Just in case...others here will likely have other suggestions about how to deal with such confidence interval estimation. It's an important issue, considering that simple description of the distribution of data is of critical importance.
 
#7
Suppose that the data are Poisson distributed.

I agree that this is a common problem and not only in the journals that jpkelley is referring to.

If I understand jpkelley, one method would be to estimate a model for November with an intercept only.

But suppose if the model is a second or third order polynomial for the five months November through March so that the curve looks like an upside down U (just like the graph attached above).

How to calculate a confidence interval for November in that case (based on the Poisson distribution)?

Is it anybody who has got some suggestions?