Hi,
as a quick idea: have you considered principal component analysis ?
regards
Hello everyone,
I have available 17 000 variables (SNV frequencies, a certain number of zeros) for 40 patients. Each patient is represented by its response to a treatment : 13 responses, 27 no-responses. I want to extract a subset of SNV which can have strong prediction power.
Because of the large size of set of variables, there are strong correlations, that's why I'm considering adaptive-lasso. I used glmnet R package, Ridge initial estimated coefficients and the following R code :
Is it good to do an external cross-validation like this to evaluate adaptive-lasso prediction power on my data ?Code:library(cvTools) library(glmnet) err.test.response <- c() err.test.noresponse <- c() nbiters <- 50 for(i in 1:nbiters){ ## k folds kflds <- 8 flds <- cvFolds(length(y), K = kflds) pred.test <- c() ## predicted classes class.test <- c() ## real classes for(j in 1:kflds){ ## Train x.train <- x[flds$which!=j,] y.train <- y[flds$which!=j] ## Test x.test <- x[flds$which==j,] y.test <- y[flds$which==j] ## Adaptive Weights Vetor cv.ridge <- cv.glmnet(x.train, y.train, family='binomial', alpha=0, standardize=FALSE, parallel = TRUE, nfolds = 7) w3 <- 1/abs(matrix(coef(cv.ridge, s=cv.ridge$lambda.min)[, 1][2:(ncol(x)+1)] ))^1 w3[w3[,1] == Inf] <- 999999999 ## Adaptive Lasso cv.lasso <- cv.glmnet(x.train, y.train, family='binomial', alpha=1, standardize=FALSE, parallel = TRUE, type.measure='class', penalty.factor=w3, nfolds = 7) ## Prediction pred.test <- c(pred.test, predict(cv.lasso, x.test, s = 'lambda.1se', type = c("class"))) class.test <- c(class.test, as.character(y.test)) } ## Prediction error err.test.noresponse <- c(err.test.noresponse, 1-sum(pred.test=="noresponse"&class.test=="noresponse") /sum(class.test=="noresponse")) # noresponse error vector err.test.response <- c(err.test.response, 1-sum(pred.test=="response"&class.test=="response") /sum(class.test=="response")) # response error vector } mean(err.test.noresponse) ## Mean noresponse prediction error mean(err.test.response) ## Mean response prediction error
My results are not conclusive at all, I have mean(err.test.noresponse) = 0.15 and mean(err.test.response)=0.88, so my model doesn't succeed to identify the response. Have you got an idea why my results are so bad and how could I improve this ?
Thanks for your help and your ideas,
Corentin
Hi,
as a quick idea: have you considered principal component analysis ?
regards
I can't recall what SNV stands for, but I am guessing something like single nucleotide ..., or is a some gene marker. So you have 17,000 markers and you want to see if any are related to your binary outcome with the lesser outcome group being 15 or 33%. Lasso is a shrinkage/selection procedure so is better than ridge for your purpose. Its prior belief is all variables have 0 predictive probability and data has to move the posterior probability.
You did CV which is great, which would have trained on 35 observations and tested on only 5, so 33%(5) would have been 1.7 may have been the lesser outcome group and how often was the explanatory variable present in that 1 outcome? You can probably look at your training folds to ensure they had representation of the lesser outcome group. I am guessing the sample size was your hindrance. Is lasso a common approach in similar studies looking at gene studies with sparse binary vectors, (sparse data combinations). Also, how common were the gene markers, were most 0s, is that what you meant? I don't know the answer, but are e-nets better with this?
Also, I think most of the regularization models were created based on linear outcomes and predictors, so there application to binary is still a little developmental. Were all of your betas pretty much zero? Side note, you are ruling out interaction terms, but is that reasonable?
I would say, look to find a comparable research question and see how they addressed it. I am interested in your question, in that I need to do regularization in a couple of months using a binary dependent variable. So update this thread as appropriate to help others. My project wont be as sparse as yours in that I wont be looking at a mountain of binary predictors.
Stop cowardice, ban guns!
Did the glmnet automate the selection of the penalization parameter, or do you need to specify different values and then rerun the CV-fold procedure and examine mis-classification? Ah, I now see the 1se line. Another option is that there is not a relationship.
Stop cowardice, ban guns!
Tweet |