Are small-n significant results "less reliable"?

I've heard this claim a lot in online discussions, but it's usually less than clear what the person means, or even if they know what they are talking about.

As far as I know, there are two clear disadvantages of small-n:

1. If it's small enough, then your test might not be robust against violations of a normality assumption.

2. You'll need a large effect size to achieve reasonable power.

Assuming a decent-enough n (say n=30) that #1 is not a major worry, then I always assumed that statistical significance is statistical significance--regardless of your n. It's not somehow more impressive to achieve p < .05 with a large n. (If anything, the opposite, because if your n is large enough, statistical significance doesn't preclude a miniscule effect size, whereas with a moderate or small n, statistical significance does clearly imply a decent effect size).

So is there any sense in which significant results are "less reliable" with a small sample? Like, is a study with n=30 and p=.03 less likely to successfully replicate than a study with n=300 and p=.03?

I've been trying to think of a way this could be true, and I can only do so by thinking about confidence intervals. Certainly your estimate of the population effect size is less precise when you have a small n. You can see this pretty clearly from the fact that a small n produces a wider confidence interval. If you have some test value you're interested in testing, then you could convert the upper and lower bounds of your CI into a standardized measure of effect size like Cohen's d, and produce a confidence interval on effect size. And yes, this interval estimating effect size would be wider for a small-n experiment.

But the question is, what if you take a small-n confidence interval and a large-n confidence interval that both exclude a given test value. Then you replicate both the small-n and large-n experiments to produce a second confidence interval for each n. Each replication would have a chance of producing a confidence interval that now includes the test value. But would the small-n replication be more likely to show this result than the large-n replication?

And if so, how? Doesn't a 95% level of confidence mean that the true value of the parameter would be contained in 95% of confidence intervals created from identical experiments to the one just conducted? In other words, any 95% confidence interval -- however wide or narrow it may be -- is equally "reliable" at including the true parameter value. Which leads me to think that any 95% CI would also be equally "reliable" at excluding any given test value. And whatever is true of a 95% CI, is true of an alpha=.05 hypothesis test. Thus I fail to see how a statistically significant result from a small-n experiment could be less "reliable" than one from a large-n experiment if they used the same alpha.

Basically I'm just trying to understand if there's any substance to this "less reliable" claim or if it's just people on the internet who learned that "small n is bad" from a teacher that was lecturing on power, and now they've mangled that to "small n results are less reliable."
Last edited: