# How to select the training dataset

#### pepsico007

##### New Member
I have a dataset which contains dependent variance Y and several independent variances X1, ... , X10. Y has 5 possible values: A, B, C, D, E, a proportion of each value is: A: 80%, B:10%, C:5%, D:3%, E:2%.

I want to build a model to predict the value of Y, I decide to pick up 70% of the dataset to be the training dataset, and 30% to be the testing dataset. But if I do it in this way, the sample of C, D, E seems to be too small, I also see some people pick up the training dataset that A, B, C, D, E are of the same proportion, so the model can be trained to have the ability to distinguish A, B, C, D, E.

Now I need some suggestion, should I just pick up 70% of the dataset to be the training dataset; or control the proportion of A, B, C, D, E, i.e. equal proportion for each kind of Y.

Many thanks.

#### bryangoodrich

##### Probably A Mammal
Check out cross validation. I'd just do 5 and 10 fold CV.

#### pepsico007

##### New Member
Check out cross validation. I'd just do 5 and 10 fold CV.
Do I need to control the proportion of A, B, C, D, and E? Since the proportion of C, D, E is too small.