Working on a mapping project of mine, I wanted to add normalization functionality. For instance, ranking each county in the US by cut bins (say, into quintile classes) is misleading because not each county covers the same about of area. Thus, it is more apt when mapping these rankings to each enumeration unit (county) to standardize/normalize the value by dividing the value by area. Then you rank these "per area" density values. This will produce something like "persons per square kilometer" instead of just "persons per county". The latter is more likely going to have more persons, if it is a larger county, just by nature of the size of the county.
Since I'm using the maps package, I want to stick with what it has available. This has proven to be harder than anticipated.
First, you can run this function to obtain the data for any year. We'll work with 2010
Code:
# These are for later processing
library(maps)
library(mapproj)
importBLS <-
# Import county unemployment data from the BLS FTP site
#
# Arguments:
# year - Integer or character. Long-format (yyyy) year value. Valid for years 1990 to 2010.
# ... - additional arguments passed to read.fwf/read.table
#
# Returns:
# A data.frame containing cleaned up BLS data
function(year, ...) {
# Validation on year input
isValid <- year %in% paste(1990:2010)
if (!isValid)
stop("Year not supported.")
# Initalize variables to be used on import
year <- substr(year, 3, 4) # Files identified by last 2 digits
infile <- paste("ftp://ftp.bls.gov/pub/special.requests/la/laucnty", year, ".txt", sep = "")
WIDTHS <- c(8, 5, 8, 53, 4, 14, 13, 11, 9)
CLASSES <- c("factor", rep("character", 3), "factor", rep("character", 4))
FIELDS <- c("series_id", "sFIPS", "cFIPS", "name", "year", "labor", "emp", "unemp", "unrate")
# Import data from FTP into data frame
x <- read.fwf(url(infile), skip = 6, strip.white = TRUE, ...,
col.names = FIELDS, colClasses = CLASSES , widths = WIDTHS)
# Clean up imported data
x <- transform(x,
fips = as.numeric(paste(sFIPS, cFIPS, sep = "")),
labor = as.numeric(gsub(",", "", labor)), # Remove formatted strings that include ","
emp = as.numeric(gsub(",", "", emp)),
unemp = as.numeric(gsub(",", "", unemp)),
unrate = as.numeric(unrate)
); # end transform
return(x)
} # end function
df <- importBLS(2010)
Now that you have the data, we'll need a couple of things. First, we'll need a county map object (for boundary polygons) and the county lookup table.
Code:
data(county.fips)
cnty <- na.omit(county.fips) # record 2395 is NA in FIPS "south dakota,x"
m <- map('county', fill = TRUE, plot = FALSE, projection = 'bonne', param = 39)
We can use this to obtain areas for every county
Code:
cnty <- transform(cnty, area = area.map(m, polyname, exact = TRUE))
The problem now is this: we have area for every county listed by fips. The fips match those fips in df. However, we have 10 extra fips in cnty than we do in for matching fips in df. Why? Because we have counties split. You'll find names like 'state,county:subregion' instead of just 'state,county'.
I need to somehow aggregate these subregions into a 'state,county' form (really, only the fips matters since that's what links it to the data). If I can do that, I can merge cnty to df, transform df to include a density column (= unrate / area, say). Then I can run my mapping function on the density field, as desired.
I'm at a loss at the moment on how to approach this aggregation. If you want to know which records are of interest, they are these:
Edit: The only logic I can think of at the moment is this:
For each polyname that includes a ":" aggregate based on what is to the left of ":" by summing their areas. Return only what is to the left of ":". How do I make that into an R function I can operate on this data frame?
Such clarity. I should spend more time on the throne.
The problem with the solution I was anticipating is that I perceived it as being involved with aggregation. Instead, I need to take a step to transform 'cnty' by replacing its elements (if they have an ":" at all) with what is to the left of the separator. Thus, I'll have polynames with some being identical. Then I merely do a standard aggregation call to summation. This should be sufficient!