Are these dependent samples?

New Member
Hello everyone,

I have a dataset with 6,000 respondents. A regression analysis will be run on it. Then, I will break down the dataset into three separate datasets of ~2,000 cases each (A,B,C), based on a specific variable. Regressions will then be run on each of the 3 datasets. That means, each of the cases in A B and C will be the same respondents as in the original dataset.

Does this type of sample have a name? Would these be considered dependent samples? I would like to know what this is called so I can look up how I can correct for issues of multicollinearity, etc.

Miner

TS Contributor
A, B and C are subsets of the original data set. Provided that these subsets are mutually exclusive, they should be independent. Note: this does NOT mean that they are random nor representative, only that they are independent. You should also consider running the regression on the complete data set, but include that "specific variable" as an Indicator/Dummy variable in the regression. That will allow you to test the significance of the "specific variable".

Dason

A, B and C are subsets of the original data set. Provided that these subsets are mutually exclusive, they should be independent. Note: this does NOT mean that they are random nor representative, only that they are independent. You should also consider running the regression on the complete data set, but include that "specific variable" as an Indicator/Dummy variable in the regression. That will allow you to test the significance of the "specific variable".
I know this is one approach that is typically advocated but I'm not sure it's always the best approach. It makes slightly different assumptions about the problem than fitting separate regressions does. Keep in mind that with multiple regression we assume constant variance. So even comparing "separate regressions" to "multiple regression using a dummy variable and the interaction of the dummy with all other variables" which basically allows the different groups to have completely different regression lines ... these aren't exactly the same since in the first approach you don't assume equal variance for the different regressions but in the second approach you do.

New Member
Hm, this leads me to two more questions.

1. What about the relationship beteen the main dataset and subset A? They two samples are clearly not independent, but it's not really dependent either, since they overlap. Right?
2. I am interested in testing the difference between the coefficients of variable X in both the main dataset and in subset A. Is it possible to do this? Would the dependent t-test work here?

Thanks for your feedback, guys. I appreciate it.

Last edited:

Miner

TS Contributor
I know this is one approach that is typically advocated but I'm not sure it's always the best approach. It makes slightly different assumptions about the problem than fitting separate regressions does. Keep in mind that with multiple regression we assume constant variance. So even comparing "separate regressions" to "multiple regression using a dummy variable and the interaction of the dummy with all other variables" which basically allows the different groups to have completely different regression lines ... these aren't exactly the same since in the first approach you don't assume equal variance for the different regressions but in the second approach you do.
True. The use of indicator variables does assume equal variances, but a diagnostic review of the residuals should identify whether this assumption was not met. Then separate regressions may be run if necessary.

Miner

TS Contributor
1. What about the relationship beteen the main dataset and subset A? They two samples are clearly not independent, but it's not really dependent either, since they overlap. Right?
2. I am interested in testing the difference between the coefficients of variable X in both the main dataset and in subset A. Is it possible to do this? Would the dependent t-test work here?
What is your objective? Is it to test the significance of this "factor" or are you trying to validate your model using subsets of data?