I can propose a model using two ways:

First approach, use 100 variable as training (derivation) data and 50 for testing (validation) set. This results in fitting 8 nonlinear regression models, which in turn results an interaction network involving all the variables.

Second approach, use whole data for modeling. This also gives me an interaction network. However, some of the interaction are different from those obtained with approach one.

Which one is a better way?

Though i think, approach one seems promising because it has promising validation results. However, in doing so we are loosing the initial information which could be used during model development.

I think approach one could be useful for simulated data where one knows the final structure.

Since the exact information about the interaction is not known, wouldn’t it be better to use second approach so that most of the information in the data could be used.

It would be great if anyone can direct me to related research article/case studies. ]]>

Due to budget constraints, we have 5 companies that are willing to participate in my survey with 42 questions. Within each business, employee will be randomized to the three intervention mode (internet, mail, and phone). At the end our goal is to calculate the difference in response to a survey due to the three interventions for all five businesses. We will use this information to create mode effect weights to adjust the survey responses.

I am currently calculating the sample size for each mode using the formula for comparing two proportions using two independent samples. For example, I calculated that I need 2000 completed surveys within each intervention for about 80% power. I will divide that by 5 businesses which means I will need 400 completed surveys per intervention mode.

I am quite confused on whether this is a multi-stage or multiphase design approach. The selection of businesses will not be random, but the selection into the three intervention modes will be. When we analyze the data and create scores from the survey, we will be using facility fixed effects. I want to determine the proper sample size for each mode within each facility, how do I approach this problem. Thank you. ]]>