How to make my approach to Monte Carlo simulation and random sample generation more statistically robust?

#1
I´m building a MC simulation models in Excel VBA and R in parallel. Below is my current approach. Please let me know where the weaknesses lie. How could I make my approach more statistically robust without diverging too far from some simplicity? I am particularly concerned with (3) and (5) below.
  1. I have a large historical data set for the population being analysed. Population comprises 5,460 elements (N) that migrate over time through various states. I´m only interested in modelling their final states, after the complete expiration of time for all elements. Terminal values are attached for the state I´m examining.
    • I ran various tests for population normality (K-S, Shapiro-Wilk, Anderson-Darling, etc.). I rejected null hypothesis of normality based on low p-values. Also, population standard deviation is close to population mean showing non-normality.
    • Element outcomes I´m studying range in values from 0% - 100%, with outcomes congregating around 0% and 100%, and the values in-between lower. Said differently, elements tend to move together mostly in one of two directions (towards 0% final state or towards 100% final state). Histogram shows a U-shaped beta distribution (fitted alpha and beta are 0.13 and 0.17 respectively).
  2. Now for the MC part. I´m generating random samples from beta distribution, based on population parameters. I use random samples to generate a frequency (probability) plot, where I derive conclusions based on analysis of the right tail of the resulting sampling plot such as “there´s a 5% probability that the mean exceeds x%, 10% probability that the mean exceeds y%”, etc.
  3. Based on the population study I´m drawing samples from a beta distribution in the model. I wrestled with alternative of generating two-levels of random samples instead: (i) first sampling would generate one of two possible outcomes: final state of 0% or final state > 0% and (ii) second sampling would take the samples generated from the first sampling of > 0% and generate random sample outcomes ranging from 1%-100%. The second one (in (ii)) I think would have something closer to a normal distribution with a right-skew. But I opted for beta distribution for now, thinking it may be simpler to implement.
  4. Running simulations based on the fitted alpha and beta values of the beta distribution for the population results, intuitively, in overfitting.
  5. To broaden the range of sampling outcomes, I´d like to randomize the alpha and beta values for the beta distribution (so much for simplicity). Input parameters for inverse of the cumulative beta probability density function computation are probability, alpha, beta.
    • I turned back to the population and randomly grouped all elements into 52 groups of 105 elements. I calculated alpha and beta values for these groups of “randomly sorted” elements and then calculated means and standard deviations for those alpha and beta values. I haven´t run the tests yet but they look roughly normal with wide tails, or even somewhat uniform.
    • Let´s assume normality tests are passed. I´d use the normal distribution to generate random values for the alpha and beta values, embedded within the random number generator for the beta distribution referenced above, replacing the population-fitted alpha/beta parameters of 0.13/0.17 with these randomly generated alpha/beta values.
    • Or perhaps based on the tests the distribution proves more uniform, in which case I’ll simply generate random values within a range for alpha/beta values and embed those in the random number generator for the beta distribution.
 

Attachments

hlsmith

Less is more. Stay pure. Stay poor.
#2
1.) with large enough samples normality almost always gets rejected. It is better to access normality visually with histograms and q-q plots. Thanks for sharing the assumed beta dist info.

2/3.) I imagine that the beta dist is better and could converge to normal :)
Breaking data into two pieces based on artificially ascribed thresholds is usually a bad idea.
4.) the Beta dist will allow more draws from max density and its dispersion, may not overfit unless the selection of its parameters was not correct or generalizable, or too empiric. Can take into account other information when selecting them.
5.) with a large enough sample things may converge to normal dist, if that is the true underlying data generating process.

Comment, is there a way to compare your model to real data, held out of the building process to test its bias, variance, and applicability?

P.S., Welcome to the forum!