Basic questions from newbie

chakravarty456

New Member
Hi,
I am currently learning ML algorithms and implementing in R. I have a couple basic questions.
1.)Is dimensionality reduction same as feature selection? I know that in R specifying importance=T parameter in randomForest function gives you the important features based on info.gain.I was reading a bit upon PCA and came to know that it's an dimensionality reduction technique which transforms your feature space to new dimensions. How does PCA calculate the attributes importance.How to get the subset of important features using PCA in R?
2.)one of the assumption for ML algos are attributes prior to building model must be I.I.D(Identical and independently distributed). How to check about this assumption in R.
Do i need to do t.test() among all the attributes?
I may be wrong in many possible ways. please correct me if i am wrong.
Thanks,
chakravarty

trinker

ggplot2orBust
1.)Is dimensionality reduction same as feature selection? I know that in R specifying importance=T parameter in randomForest function gives you the important features based on info.gain.I was reading a bit upon PCA and came to know that it's an dimensionality reduction technique which transforms your feature space to new dimensions. How does PCA calculate the attributes importance.How to get the subset of important features using PCA in R?
No they re not the same but some what related in that you can use both to provide less variables to the model. Feature selection basically says "these are the variables that seem to be most important to my model's ability to predict ergo...I will select them and discard all others" Dimensionality reduction takes the original variables and combines them into new uncorrelated variables (this isn't exactly correct statistically but none-the-less the way I think about it). In a sense these can be the variables you then use in the model.

I think it's better for me to link to a video to explain PCA rather than using words alone to teach is inefficient for complex concepts. https://www.youtube.com/watch?v=kw9R0nD69OU

2.)one of the assumption for ML algos are attributes prior to building model must be I.I.D(Identical and independently distributed). How to check about this assumption in R.
Do i need to do t.test() among all the attributes?

Often rescaling using min-max normalization takes care of feature dominance. It is likely that the data will violate IID. Looking at the residuals will help you understand if there is an ssue that needs to be addressed. Additionally, knowledge about the variables themselves will help you to understand if you have an issue to be addressed. As an extreme if you have height in inches and also in feet these 2 variables are perfectly correlated. Knowing this you could remove one, ignore the correlation (if you think it' not extreme), do some sort of dimensionality reduction, etc. In any event violation of this assumption may not be such a big issue. You generally care about your model's ability to predict. If it is doing so within reason then do you have a problem?