Sample Size Calc for Right Skewed Data


Less is more. Stay pure. Stay poor.
I have a request from a person for a sample size calculation. When I requested information about the potential study data they provided two means and standard deviations from a prior study. Given the values (x1=7; s1=6 and x2=0.5; s2 = 2) it appears the data are right skewed. I can simulate those data to work toward a simulation study, though the values actually represent before and after values. So observations are paired. So I will have to simulate two variables, but they will have to be correlated. Also data have a lower bound of zero.

I guess I can do this, but I have no idea how correlated the paired data are. Any suggestions?

Also, are there any basics that I am missing, e.g., the difference of two paired skewed data equal...?

My current plan is to simulate two skewed variables and correlate them, I guess, since a bigger pre-value may mean an ability for greater decrease then lower pre-values which have little wriggle room to decrease. So, any suggestions would be appreciated. Or can I simulate a variable and subtract a constant from it, that sound good too.

Plan, simulate data and normalize and run say 10,000 ttests and play around with sample sizes. I can also do the same thing but run Wilcoxon sign rank tests with unnormalized data. Though if the former seems feasible, that may be a good approach, because the final study analyses may require controlling for covariates, though that was not done in the flimsy example those parameters were from.



Less is more. Stay pure. Stay poor.
If I simulate a skewed sample using a very large n-value and the parameters align with my target parameters, I am guessing I can then shrink the n-value and assume the smaller sample is a realization of my target.

So I can test sample sizes for Wilcox on sign rank test, straightforward.

What about a one-sample ttest (vs 0) of differences of two lognormal variables, though 0's may be in the sample so log transformation may require use of a constant. Any advice?

When I backtransform I will be in the median realm, but how does the constant come into play?


Can't make spagetti
Given the values (x1=7; s1=6 and x2=0.5; s2 = 2) it appears the data are right skewed.
Hi. I don't quite follow how can you deduce that the data is right-skewed just from those two pieces of information. Or did they show you some histograms or some other stuff?


Less is more. Stay pure. Stay poor.
They are left bounded by zero. So am I just making **** up, I mean making a too big of an assumption. The study used a ttest, also the sample size was 20.

Not too much to work from, right? What approach would you take?


TS Contributor
from (7,6) to (0.5,2) looks like a huge effect - a simple permutation test with a quite low sample size might be sufficient, no? Simulating that should be quite easy.



Less is more. Stay pure. Stay poor.
Thanks. Agreed. And the permutation test is considered a non-parametric, so skewness isn't considered? Of note, I plan to look at the values' differences between pre/post, so is there a one-sample permutation test. Since you wouldn't be just switching assignment for all of the observations between two groups due to there only being one group?

A lingering issue I had in my mind was that the pre and post measures should be correlated not just two independent samples, but I don't know by how much they are correlated. A generic work around if I don't get the covariance structure right, may be to simulate two sets then sort them individually and then match them based on order. However that would be a too optimistic version of the actual scenario I would imagine.


Can't make spagetti
Well, if I learned anything from CBear's blog:

was that the whole non-normality brouhaha is mostly overblown, particularly for simple (and, usually, quite robust) tests such as the t-test and whatnot.

I honestly wouldn't freak out too much if people are using known-and-tried power analysis methods (like those from G*Power).

As far as the correlation aspect goes, maybe you can try a few like say 0, .1, .3, .5 for "independence", "small", "medium" and "large" effect sizes a la Cohen and see how bad things can get?

It's easy to simulate correlated, non-normal data in something like lavaan or semTools. And I know SAS has a macro out there somewhere that uses the same method as lavaan, in case you need it.

I'd provide R code but I'm not sure if it would be particularly useful to you (?)


Ambassador to the humans
Ehhhh I wouldn't be so quick to dismiss the non-normality here.

hlsmith said:
also the sample size was 20

The linked article said:
That technical note aside, the net effect is that the headline figure of a Type I error rate of 17% is based on a tiny sample size (18) and an extremely unusual degree of non-normality
So depending on the severity of the skew it could have a decent impact with sample sizes this small.


Can't make spagetti
OMG 20!?!?!!

I totally missed that part. Yeah, then it seems like you have a case of the ugly here (where ugly means small N :))


Less is more. Stay pure. Stay poor.
I feel like this is a silly question and Dason alluded to it in a post I had a couple of years ago, but alas without my Advance Search button, here I am.

If I am doing a power simulation are the following correct:

-sample size: is whatever I am using in the simulation
-alpha: is the level of significance I am using for cut off in the simulation
-power: is the number of times the null is reject given the above parameters

So if I am doing this with a ttest for example, I set my sample size and alpha, then I get my "power" from the number of times out of the number of samples that I rejected the null (e.g., p-value </= 0.05), correct?


Ambassador to the humans
You *estimate* your true power based on the proportion of trials in which you reject the null at your chosen alpha level. So you had the gist of it right but make sure you're talking about proportion because it doesn't make sense to say the power is 838 and if you're going to specify an alpha then you can't just always compare against 0.05 ... unless you *always* use 0.05 ;)


Can't make spagetti
Well, it really is very simple. It would look like this:

mod  <- "x1 ~~ 6*x2
              x1 ~~ 36*x1
              x2 ~~ 4*x2
              x1 ~ 7*1
              x2 ~ 0.5*1"
N<- 100
skew <- c(2,2)
kurt <- c(7,7)
data <- simulateData(mod, sample.nobs=N, skewness=skew, kurtosis=kurt)
   vars   n mean   sd median trimmed  mad   min   max range skew kurtosis   se
x1    1 100 6.58 5.83   6.32    6.33 1.50  3.64 13.48  9.85 1.46     2.57 0.19
x2    2 100 0.47 1.87  -0.18   -0.04 0.75 -1.31  7.30  8.61 2.71    10.45 0.13
Notice that I had to square your 6 and your 2 in the SD section because lavaan takes in variances to create the variance-covariance matrix.

So the mod part specifies that that has a correlation of 0.5, the means and standard deviations that you mentioned, univariate skewnesses of 2 and kurtoses of 7.

Then you could do something like:

> t.test(data$x1,data$x2)

        Welch Two Sample t-test

data:  data$x1 and data$x2
t = 9.8325, df = 125.79, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 4.380432 6.588089
sample estimates:
mean of x mean of y 
6.0769192 0.5926587
And I guess repeat that a gazillion times to see how the power looks like.
Last edited:
What is the "delta" that your client wish to detect with power= 0.80 and significance level 0.05?

Why not just assume log-normal and get a sample size base on that (and from say the two std:s)?

Possibly assume gamma distribution and simulate from that (Gamma and the log-normal are relatively similar.)


Less is more. Stay pure. Stay poor.
Thanks Greta. I have since resolved the questions related to the project. They actually wanted to do a two independent sample test. Though, they gave me an example study that they wanted to emulate, which was a before and after study, so I presumed that is what they wanted. I was able to bang something out for both scenarios in SAS. Though for the correlated simulation, I just kept trying values for location, etc. in a huge sample until I got two distributions that were close enough.

I had wondered if I could use the Gamma. I had also wondered if I could surmise a delta and dispersion parameter using the two possibly skewed based means and SDs.

So is there any rule about the difference of two lognormals equaling something like a lognormal, given or not given dependency of values. I had thought it would be easier to just work with the differences, though I was only given the two means and SDs to work from. Though, as Miner pointed out the two groups were very different in values, and actually would require less than 20 patients to test the hypothesis.