Stubborn Charachters! Need help converting CHR variable to Numeric (in R)

#1
Hi, I am an "R beginner", still getting my bearings.... please help.
I read in a csv as follows:
defense <- read.csv ("defense.csv", header=T, stringsAsFactors=F)
My issue is the dollar "amount" field was assigned "character" type, as follows:
str (defense)
'data.frame': 5024 obs. of 11 variables:
$ amount: chr "$548,277.00 " "$1,063,790.35 " "$2,988.72 "

I wish to change the "amount" field to numeric, so I can run meaningful aggregate functions i.e. the MEAN for the vector. I failed using the "as.numeric" and "type.convert", see following;
as.numeric ("defense$amount")
type.convert(defense$amount, na.strings = "NA", as.is = FALSE, dec = ".")
The str function still tells me that "amount" type is still "chr"
Any ideas? Thank you. Eli
 

Dason

Ambassador to the humans
#2
The problem is that you do have characters in there. "," and "$" aren't numeric and aren't the decimal point so it doesn't know how to deal with those. The easiest way to deal with it is to remove those characters and then convert to numeric.

Code:
> x <- c("$1,000.2", "$3", "2.54")
> x
[1] "$1,000.2" "$3"       "2.54"    
> gsub("[$,]", "", x)
[1] "1000.2" "3"      "2.54"  
> as.numeric(gsub("[$,]", "", x))
[1] 1000.20    3.00    2.54
 
#3
Dason, thank you, the first line of code you recommended gsub... did as you indicated,
[5021] "279383.90 " "529923.40 " "32912.20 " "8008.80 "

However when I tried the as.numeric... I got the following response
[5021] 2.793839e+05 5.299234e+05 3.291220e+04 8.008800e+03
Warning message:
NAs introduced by coercion

And then when I checked the filed with the str command I seem to be back to the where I started with the $ and comma's
> str (defense2$amount)
chr [1:5024] "$548,277.00 " "$1,063,790.35 " "$2,988.72 " ...


Any ideas, much thanks. Eli
 

Dason

Ambassador to the humans
#4
I don't know why you are getting NAs for some values - I would need to see the data for that. But the way it is displayed is perfectly understandable - it uses scientific notation for large numeric values.

I think the second issue is that my code does the conversion but it doesn't write over the previous data. It's the difference between

Code:
> x <- 3 # original data
> x
 [1] 3
> # conversion
> x + 2
 [1] 5
> x # x is unchanged though
 [1] 3
> # write the changes back to the variable
> x <- x + 2
> x
 [1] 5
Hopefully the change you need to make becomes obvious now.