I´m building a MC simulation models in Excel VBA and R in parallel. Below is my current approach. Please let me know where the weaknesses lie. How could I make my approach more statistically robust without diverging too far from some simplicity? I am particularly concerned with (3) and (5) below.
- I have a large historical data set for the population being analysed. Population comprises 5,460 elements (N) that migrate over time through various states. I´m only interested in modelling their final states, after the complete expiration of time for all elements. Terminal values are attached for the state I´m examining.
- I ran various tests for population normality (K-S, Shapiro-Wilk, Anderson-Darling, etc.). I rejected null hypothesis of normality based on low p-values. Also, population standard deviation is close to population mean showing non-normality.
- Element outcomes I´m studying range in values from 0% - 100%, with outcomes congregating around 0% and 100%, and the values in-between lower. Said differently, elements tend to move together mostly in one of two directions (towards 0% final state or towards 100% final state). Histogram shows a U-shaped beta distribution (fitted alpha and beta are 0.13 and 0.17 respectively).
- Now for the MC part. I´m generating random samples from beta distribution, based on population parameters. I use random samples to generate a frequency (probability) plot, where I derive conclusions based on analysis of the right tail of the resulting sampling plot such as “there´s a 5% probability that the mean exceeds x%, 10% probability that the mean exceeds y%”, etc.
- Based on the population study I´m drawing samples from a beta distribution in the model. I wrestled with alternative of generating two-levels of random samples instead: (i) first sampling would generate one of two possible outcomes: final state of 0% or final state > 0% and (ii) second sampling would take the samples generated from the first sampling of > 0% and generate random sample outcomes ranging from 1%-100%. The second one (in (ii)) I think would have something closer to a normal distribution with a right-skew. But I opted for beta distribution for now, thinking it may be simpler to implement.
- Running simulations based on the fitted alpha and beta values of the beta distribution for the population results, intuitively, in overfitting.
- To broaden the range of sampling outcomes, I´d like to randomize the alpha and beta values for the beta distribution (so much for simplicity). Input parameters for inverse of the cumulative beta probability density function computation are probability, alpha, beta.
- I turned back to the population and randomly grouped all elements into 52 groups of 105 elements. I calculated alpha and beta values for these groups of “randomly sorted” elements and then calculated means and standard deviations for those alpha and beta values. I haven´t run the tests yet but they look roughly normal with wide tails, or even somewhat uniform.
- Let´s assume normality tests are passed. I´d use the normal distribution to generate random values for the alpha and beta values, embedded within the random number generator for the beta distribution referenced above, replacing the population-fitted alpha/beta parameters of 0.13/0.17 with these randomly generated alpha/beta values.
- Or perhaps based on the tests the distribution proves more uniform, in which case I’ll simply generate random values within a range for alpha/beta values and embed those in the random number generator for the beta distribution.
Attachments
-
38.5 KB Views: 0