# Adding a new column in R data frame with values conditional on another column

#### econlearner

##### New Member
Suppose I have the data frame:

table<- data.frame(population=c(100, 300, 5000, 2000, 900, 2500), habitat=c(1,2,3,4,5,6))

Now I want to add a new column table$size with the values 1 if population< 500, 2 if 500<=population<1000, 3 if 1000<=population<2000, 4 if 2000<=population<3000, 5 if 3000<=population<=5000 I only know how to create a column with a binary TRUE/FALSE outcome conditional on the values in another column , e.g. table$size <- (table$population<1000) But I'm not sure to do it to get different numbers for different conditions. Can anyone provide help on this? #### ledzep ##### Point Mass at Zero Here are two solutions. Take your pick. Code: ## your data table<- data.frame(population=c(100, 300, 5000, 2000, 900, 2500), habitat=c(1,2,3,4,5,6)) ## Solution 1 table$size[table$population<500]<-1 table$size[table$population>=500 & table$population<1000]<-2
table$size[table$population>=1000 & table$population<2000]<-3 table$size[table$population>=2000 & table$population<3000]<-4
table$size[table$population>=3000 & table$population<=5000]<-5 ## Solution 2 table$size1<-ifelse(table$population<500,1, ifelse(table$population>=500 & table$population<1000,2, ifelse(table$population>=1000 & table$population<2000,3, ifelse(table$population>=2000 & table$population<3000,4,5 )))) >table population habitat size size1 1 100 1 1 1 2 300 2 1 1 3 5000 3 5 5 4 2000 4 4 4 5 900 5 2 2 6 2500 6 4 4 #### Dason ##### Ambassador to the humans findInterval seems like a more appropriate function for this particular task I think Code: table<- data.frame(population=c(100, 300, 5000, 2000, 900, 2500), habitat=c(1,2,3,4,5,6)) table$size <- findInterval(table$population, c(0, 500, 1000, 2000, 3000, 5000), rightmost.closed = TRUE) which gives Code: > table population habitat size 1 100 1 1 2 300 2 1 3 5000 3 5 4 2000 4 4 5 900 5 2 6 2500 6 4 #### Dason ##### Ambassador to the humans Aww man - not only do I find out that you crossposted at SO, But you accepted quite possibly the worst answer for this problem... #### bryangoodrich ##### Probably A Mammal Never knew about this findInterval function. Nice! It does seem to be a bit easier than cut at times. #### econlearner ##### New Member My apologies I just knew that crossposting isn't allowed, I am very new to this. I changed the accepted answer to the findinterval() solution. #### trinker ##### ggplot2orBust econlearner, We all start out new. I learned the same lesson myself (cross posting). Some people were very nasty about this unwritten or in some cases written rule and made me feel about 2 inches high (Dason was comical in his rebuke ). Let me explain to you the general convention I've seen used so you don't make the mistakes I've made. 1) post your question on a site you find is most appropriate for your question. 2) life happens and sometimes people can't help you or you realize the question is better suited elsewhere 3) put a link in both places stating you've done this and why The reasoning for this is so people don't waste time solving a question that's been solved elsewhere. It also keeps things together for future searchers with a similar problem. Here's an example of where I've posted in 2 places and made it clear I've done so: http://stackoverflow.com/questions/9305471/zip-file-error-in-reading-in-an-https-url You'll notice there's a link at both locations to the other and I've told everyone what I'm doing an why. Hopefully, this is helpful. ======================== To Dason, didn't know about the findInterval. +1 #### ledzep ##### Point Mass at Zero Lovely one Dason and very elegant too. I am going to be using the findInterval a lot in the future. I also like the fact they allow the option to include or not include the boundaries. #### Dason ##### Ambassador to the humans I am going to be using the findInterval a lot in the future. I also like the fact they allow the option to include or not include the boundaries. By default the left boundary is included and the right boundary is not included. What I did with the rightmost.closed=TRUE parameter was to tell it that the largest bin should have it's rightmost boundary closed. It wouldn't make sense to have all of the boundaries be closed because then what happens when something falls on a boundary? It needs to be able to decide if it should go with the lower bin or the higher bin. #### tmai ##### New Member Here are two solutions. Take your pick. Code: ## your data table<- data.frame(population=c(100, 300, 5000, 2000, 900, 2500), habitat=c(1,2,3,4,5,6)) ## Solution 1 table$size[table$population<500]<-1 table$size[table$population>=500 & table$population<1000]<-2
table$size[table$population>=1000 & table$population<2000]<-3 table$size[table$population>=2000 & table$population<3000]<-4
table$size[table$population>=3000 & table$population<=5000]<-5 ## Solution 2 table$size1<-ifelse(table$population<500,1, ifelse(table$population>=500 & table$population<1000,2, ifelse(table$population>=1000 & table$population<2000,3, ifelse(table$population>=2000 & table\$population<3000,4,5
))))

>table
population habitat size size1
1        100       1    1     1
2        300       2    1     1
3       5000       3    5     5
4       2000       4    4     4
5        900       5    2     2
6       2500       6    4     4
I tried this code and it returned an error : "Unknown or uninitialised column:". I had to make the column first to fix this error. In my case, I created a column named "Hour" and assigning the expected values first. Then follow through Solution 1 and append these values accordingly to what's in the reference column.

Thank you for making my life easier with these codes!

#### trinker

##### ggplot2orBust
Can you provide the code you tried and the error message you got?