I have a model to build that employs a reference table of 5000 standard covariates that are transformed by the intersection of that reference table with the actual data table. These are mostly CATEGORICAL covariates. Better yet, they are BINARY ("Yes","No" or 1,0).

The software I use only allows for 999 covariates plus one target.

Here is the technique I propose:

1) inner join the actual data to the covariates reference table.

2) count the occurence of each of the joined corvariates.

3) eliminate the bottom x% of the covariates based on a frequency analysis.

The thinking here is that low frequency covariates cannot be effective as predictors simply because their occurrence is rare. Even if they were good predictors, the number of false negatives that would be introduced by not using them would be small.

It seems standard Principal Components Analysis does not work with categorical data, especially binary cats.

Does this technique have any validity ? I am trying to avoid the more complex approach of running the model iteratively, determining the significant covariates from the model output and then reassembling them into a final model. Whew, a lot of data manipulation.

The software I use only allows for 999 covariates plus one target.

Here is the technique I propose:

1) inner join the actual data to the covariates reference table.

2) count the occurence of each of the joined corvariates.

3) eliminate the bottom x% of the covariates based on a frequency analysis.

The thinking here is that low frequency covariates cannot be effective as predictors simply because their occurrence is rare. Even if they were good predictors, the number of false negatives that would be introduced by not using them would be small.

It seems standard Principal Components Analysis does not work with categorical data, especially binary cats.

Does this technique have any validity ? I am trying to avoid the more complex approach of running the model iteratively, determining the significant covariates from the model output and then reassembling them into a final model. Whew, a lot of data manipulation.

Last edited: