An R learning project (feel free to learn with me)

Dason

Ambassador to the humans
#42
Code:
'data.frame':	7345 obs. of  14 variables:
 $ adminstrator: chr  "ACTING SUPERINTENDENT - DR. RAYMOND COLUCCIELLO" "PRINCIPAL - MR. KEN LEIN" "PRINCIPAL - MS. VIBETTA  SANDERS" "PRINCIPAL - MR. THOMAS GIGLIO" ...
 $ school      : chr  "ALBANY CITY SD" "MONTESSORI MAGNET SCHOOL" "PINE HILLS ELEMENTARY SCHOOL" "DELAWARE COMMUNITY SCHOOL" ...
 $ beds        : chr  "010100010000" "010100010014" "010100010016" "010100010018" ...
 $ address     : chr  "ACADEMY PARK" "65 TREMONT ST" "41 N ALLEN ST" "43 BERTHA ST" ...
 $ city        : chr  "ALBANY" "ALBANY" "ALBANY" "ALBANY" ...
 $ state       : chr  "NY" "NY" "NY" "NY" ...
 $ zip         : chr  "122071099" "122061798" "122031601" "122092102" ...
 $ area        : chr  "518" "518" "518" "518" ...
 $ phone       : chr  "4756010" "4756675" "4756725" "4756750" ...
 $ record      : Factor w/ 5 levels "1","2","3","4",..: 3 1 1 1 1 1 1 1 1 1 ...
 $ grade       : Factor w/ 9 levels "0","1","2","3",..: NA 2 2 2 2 2 2 2 2 2 ...
 $ locations   : chr  "ACADEMY PARK+ALBANY+NY+122071099" "65 TREMONT ST+ALBANY+NY+122061798" "41 N ALLEN ST+ALBANY+NY+122031601" "43 BERTHA ST+ALBANY+NY+122092102" ...
 $ lat         : num  42.7 42.7 42.7 42.6 42.7 ...
 $ lng         : num  -73.7 -73.8 -73.8 -73.8 -73.8 ...
Looks good to me.
 

trinker

ggplot2orBust
#43
I uploaded the geocoded data (post #41) as a zip that extracts to a .RData file. Here is the current script after geocoding in which I have read in the xls forms from NYS using read.xls from gdata (that was a first for me). We are on step three getting demographic data and will soon be moving to step 4 of this learning project.

Code:
# browseURL("http://www.r-chart.com/2010/07/maps-geocoding-and-r-user-conference.html")
# browseURL("https://github.com/hadley/ggplot2/wiki/Crime-in-Downtown-Houston,-Texas-:-Combining-ggplot2-and-Google-Maps")
# browseURL("http://www.nysed.gov/admin/SEDdir.txt")#The Dta File 1
# browseURL("http://www.nysed.gov/admin/LayOut.txt")#The Key
# browseURL("http://www.nysed.gov/admin/LayOut.txt")#Testing data files

#  save(dat, file = "dat.RData")
#  load("dat.RData")

##########################################################################################
# STEP 1: READ INT THE ADRESSES, BEDS, AND DEOMGRAPHICS OF ALL SCHOOLS IN NEW YORK STATE #
##########################################################################################
path <- "http://www.nysed.gov/admin/SEDdir.txt"
cnames <- c("adminstrator", "school", "beds", "address", "city", "state", 
              "zip", "area", "phone", "record", "grade")
classes  <- c(rep("character", 9), rep("factor", 2))

dat <- read.csv(file = url(path), header = FALSE, strip.white = TRUE, sep= ",", 
           na.strings= c(" ", ""),stringsAsFactors = FALSE, col.names = cnames, 
           colClasses = classes)

dat$locations <- as.character(with(dat, paste(address, city, state, zip, sep="+")))
NAer <- function(x) which(is.na(x)) #use to locate missing data
lapply(dat, NAer)   #use to locate missing data
dat <- dat[-283, ]   #remove observation 283
###########################################################################################
#                 STEP 2: CONVERT THE ADDRESS TO LAT AND LONGITUDE.                       #
###########################################################################################
# Many geocoding websites limit you to 2500 requests per day.                             #
# We will utilize two geo coding websites to work through the                             #
# problem faster.  The two sites we will be utilizing:                                    #
#                                                                                         #
# Google API:                                                                             #
# browseURL("http://code.google.com/apis/maps/articles/geocodestrat.html")                #
#                                                                                         #
# University of Southern California's geocoding web services:                             #
# browseURL("https://webgis.usc.edu/Services/Geocode/WebService/GeocoderWebService.aspx") #
# NOTE: This site requires you sign up for an use an api.key that you must provide to     #
# the csus_geocode function.                                                              #
###########################################################################################
# GOOGLE'S API GEOCODING FUNCTION #
###################################
google_geocode <- function(ADDRESS){
    require(XML); require(RCurl)
    requestXML <- function(address) {
        URL <- paste("http://maps.google.com/maps/api/geocode/xml?sensor=false&address=", address, sep = "")
        xmlRequest <- getURL(URL)
        xml <- xmlToList(xmlRequest)
        return(xml)
    }

    A2 <- gsub("+", ",+", ADDRESS, fixed = TRUE)
    A3 <- gsub(" ", "+", A2, fixed = TRUE)
    x <- requestXML(A3)
    y <- c(lat = as.numeric(x$result$geometry$location$lat), lng = as.numeric(x$result$geometry$location$lng))
    return(y)
} #end of function
###########################################################################################
# df <- dat[sample(1:nrow(dat), 20), ]  #test it with 20 first
# df <- dat[, ]  #Select specific rows (chunks for geocoding)
addresses <- df$locations
latitude <- vector("numeric", length(addresses))
longitude <- vector("numeric", length(addresses))

for (n in seq(addresses)) {
  x <- google_geocode(addresses[n])
  latitude[n]  <- x[1]
  longitude[n] <- x[2]
  Sys.sleep(0.1)  # Suspend requests for a tenth of a second
}  
 
df <- cbind(df, lat = latitude, lng = longitude)
###########################################################################################
# UNIVERSITY OF SOUTHERN CALIFORNIA'S GEOCODING FUNCTION #
##########################################################
csus_geocode <- function(ADDRESS, api.key = "INSERT YOUR OWN API CODE HERE"){
    require(XML); require(RCurl)
    ADDRESS <- unlist(strsplit(ADDRESS, "+", fixed = TRUE))
    
    geocode_url <- function(address, city, state, zip, api.key){ #function to make urls
        address <- gsub(" ", "+", address, fixed = TRUE)
        city <- gsub(" ", "+", city, fixed = TRUE)
        zip <- substring(zip, 1, 5)
        z <- "http://webgis.usc.edu/Services/Geocode/WebService/GeocoderWebServiceHttpNonParsed_V02_96.aspx?"
        x <- paste(z, "streetAddress=", address, "&city=", city, "&state=", state, "&zip=", 
            zip, "&census=false&format=xml&version=2.96&apiKey=", api.key, sep="") 
        return(x)
    } #end of url creation helper function

    requestXML <- function(URL) { #address to coordinates function
        URL <- URL
        xmlRequest <- getURL(URL)
        xmlRequest <- substr(xmlRequest, 4, nchar(xmlRequest)) #removes 4 garbage characters at the begining
        xml <- xmlToList(xmlRequest)
        return(xml)
    }#end of address to coordinates function

    URL <- geocode_url(address = ADDRESS[1], city = ADDRESS[2], state = ADDRESS[3], zip =ADDRESS[4], 
        api.key = api.key)
    x <- requestXML(URL)
    y <- c(lat = as.numeric(x$OutputGeocode$Latitude), lng = as.numeric(x$OutputGeocode$Longitude))
    return(y)
} #end of function
#####################################################################################################
# df <- dat[sample(1:nrow(dat), 20), ]  #test it with 20 first
# df <- dat[, ]  #Select specific rows (chunks for geocoding)
addresses <- df$locations
latitude <- vector("numeric", length(addresses))
longitude <- vector("numeric", length(addresses))

for (n in seq(addresses)) {
  x <- csus_geocode(addresses[n])
  latitude[n]  <- x[1]
  longitude[n] <- x[2]
  Sys.sleep(0.1)  # Suspend requests for a tenth of a second
}  # end of function

df <- cbind(df, lat = latitude, lng = longitude)
#####################################################################################################
#              GEOCODING A WAITING GAME :)                             #
########################################################################
df <- dat[1:4500, ]  #Select specific rows (chunks for geocoding)
addresses <- df$locations
latitude <- vector("numeric", length(addresses))
longitude <- vector("numeric", length(addresses))

START <- Sys.time()
for (n in seq(addresses)) {
  x <- csus_geocode(addresses[n])
  latitude[n]  <- x[1]
  longitude[n] <- x[2]
  Sys.sleep(0.1)  # Suspend requests for a tenth of a second
}  # end of function

df1 <- cbind(df, lat = latitude, lng = longitude)
END <- Sys.time(); END - START #Timing purposes
save(df1, file="PART1.RData")
load("PART1.RData")
##############################################################################################
df <- dat[4501:nrow(dat), ]  #Select specific rows (chunks for geocoding)
addresses <- df$locations
latitude <- vector("numeric", length(addresses))
longitude <- vector("numeric", length(addresses))

START <- Sys.time()
for (n in seq(addresses)) {
  x <- csus_geocode(addresses[n])
  latitude[n]  <- x[1]
  longitude[n] <- x[2]
  Sys.sleep(0.1)  # Suspend requests for a tenth of a second
}  # end of function

df2 <- cbind(df, lat = latitude, lng = longitude)
END <- Sys.time(); END - START 
save(df2, file="PART2.RData")
################################################################################################
# MERGEING THE TWO DATA SETS AND SAVING TO .RDATA #
###################################################

load("PART1.RData"); load("PART2.RData")
geodat <- as.data.frame(rbind(df1, df2))
qview(geodat)
save(geodat, file="geodat.RData")

################################################################################################
# STEP 3 GATHERING MORE DATA (ACADEMIC ACHIEVEMENT) #
#####################################################
load("geodat.RData")
mathURL <- "http://www.p12.nysed.gov/irs/ela-math/2011/2011Grades3-8MathDistrictandBuildingAggregates_Media.xls"

classes  <- c(rep("character", 10))
cnames <- c("beds", "name", "subject", "changed", "n.math", "lv1m", "lv2m", "lvl3m", "lvl4m", "mean.m")
library(gdata)
math <- read.xls(xls= mathURL, sheet=1, header=FALSE, strip.white=T, sep=",",as.is=FALSE, 
           na.strings=c("999","NA"," "), stringsAsFactors = FALSE, , col.names = cnames,
           colClasses = classes, skip = 1, method="csv")

elaURL <- "http://www.p12.nysed.gov/irs/ela-math/2011/2011Grades3-8ELADistrictandBuildingAggregates_Media.xls"
classes  <- c(rep("character", 10))
cnames <- c("beds", "name", "subject", "changed", "n.ela", "lv1e", "lv2e", "lvl3e", "lvl4e", "mean.e")
ela <- read.xls(xls= elaURL, sheet=1, header=FALSE, strip.white=T, sep=",",as.is=FALSE, 
           na.strings=c("999","NA"," "), stringsAsFactors = FALSE, , col.names = cnames,
           colClasses = classes, skip = 1, method="csv")

save(ela, math, file="ELA_MATH.RData")
######################################################
# STILL SOME CLEANING TO DO BUT THAT'S FOR TOMORROW. #
######################################################
Note: I listed both geo coding websites but the flexibility and accuracy of the University of Southern California proved to be superior for multiple reasons and was the sole API site used to geocode.
 

bryangoodrich

Probably A Mammal
#44
What demographics are you interested in? I know a few sources in this area. And about what sort of maps you want to make, you should concentrate on that once you're comfortable using the tools for this project to simply create a dot (pin) map of the locations you've obtained. From there you'll want to then adjust what you're plotting to be the plots of things you're interested in. I'm too tired now, but I'll update with some cool stuff I found, since I now have to start thinking about my own project and what I'm going to do.
 

trinker

ggplot2orBust
#45
bryangoodrich said:
What demographics are you interested in? I know a few sources in this area.
There's some demographic information in the VADIR reports on the NYS website and school report cards (race, gender, teacher makeup, teacher experience, per pupil expenditures, poverty indexes, etc.) but I think I have a lot of demographic data in the original data set and the academic information I grabbed above also has the potential to be very interesting (plus all the data I'm talking about is in a MS Acess file which isn't impossible to import in but can be a pain).

To me I want this to be interesting and reveal something cool but I'm more interested in the process of doing this than the product.
 

trinker

ggplot2orBust
#47
I had seen some discussion on SO (even from Dirk who is very talented) that made it seem like getting RODBC to work for Access was a voodoo trick. I will attempt that at the end of this project as I don't want to get bogged down (I have 0 experiences with RODBC).
 

bryangoodrich

Probably A Mammal
#48
Most databases have some Open Database Connectivity (ODBC). The idea of that standard is that it is platform and system independent. It may not give full functionality that the database provides, but it is a standard library for accessing databases. Access products have ODBC drivers; it works on Access, and I think there's a "wrapper" for doing Access commands, specifically (see odbcConnectAccess). From there, it's just using basic SQL, which can simplify your processing if you want to do it more efficiently (use WHERE and GROUP BY clauses to subset and aggregate information before loading into R, for instance). SQL is one of the easiest languages to learn, really. If you need help with SQL, let me know. I'm not amazing with it, but I know enough.
 

trinker

ggplot2orBust
#49
Ready for Step 4 (Plotting data) :)
I have finished step 3 which was grabbing, cleaning, and merging some academic data to incorporate with the geocoding data. We're ready to start plotting some data. I wanted to learn to use some simpler looking maps from the maps package rather than google maps right away. I think in some visualizations the simpler map can actually be better and thus want to learn how to use it as well.

The script for this project is becoming too large to post and there fore will only post the code for the current step I'm working through. I'll attach the completed script in a zipped folder along with the geocoded data set (the assessment one is too large so you'll have to read it in yourself)

Code:
################################################################################################
# STEP 3 GATHERING MORE DATA (ACADEMIC ACHIEVEMENT) #
#####################################################
#############################################
# IMPORT AND CLEAN THE MATH ASSESSMENT DATA #
#############################################
library(gdata)
mathURL <- "http://www.p12.nysed.gov/irs/ela-math/2011/2011Grades3-8MathDistrictandBuildingAggregates_Media.xls"
classes  <- c(rep("character", 10))
cnames <- c("beds", "name", "subject", "changed", "n.math", "lv1m", "lv2m", "lvl3m", "lvl4m", "mean.m")

math <- read.xls(xls= mathURL, sheet=1, header=FALSE, strip.white=T, sep=",",as.is=FALSE, 
           na.strings=c("999","NA"," "), stringsAsFactors = FALSE, col.names = cnames,
           colClasses = classes, skip = 1, method="csv")

math[ "subject"] <- as.factor(gsub("Grade ", "", gsub(" Math", "", math[, c("subject")])))
names(math)[3] <- "grade"
math[4] <- NULL  #drop the changed column
math[, 4:9] <- lapply(math[, 4:9], as.numeric)

############################################
# IMPORT AND CLEAN THE ELA ASSESSMENT DATA #
############################################
elaURL <- "http://www.p12.nysed.gov/irs/ela-math/2011/2011Grades3-8ELADistrictandBuildingAggregates_Media.xls"
classes  <- c(rep("character", 10))
cnames <- c("beds", "name", "subject", "changed", "n.ela", "lv1e", "lv2e", "lvl3e", "lvl4e", "mean.e")
ela <- read.xls(xls= elaURL, sheet=1, header=FALSE, strip.white=T, sep=",",as.is=FALSE, 
           na.strings=c("999","NA"," "), stringsAsFactors = FALSE, col.names = cnames,
           colClasses = classes, skip = 1, method="csv")

ela[ "subject"] <- as.factor(gsub("Grade ", "", gsub(" ELA", "", ela[, c("subject")])))
names(ela)[3] <- "grade"
ela[2] <- NULL; ela[3] <- NULL #drop the school name and changed columns
ela[, 3:8] <- lapply(ela[, 3:8], as.numeric)


assess.dat <- merge(math, ela, by = c("beds", "grade"))
#  save(assess.dat, ela, math, file="ELA_MATH.RData")
################################################################################################
# STEP 4 PLOT THE DATA ON A SIMPLE MAP USING: ggplot2, maps, and mapproj #
##########################################################################
View attachment 2049 .
 

bryangoodrich

Probably A Mammal
#51
LOL Sometimes Oracle can make me angry. (The other reason is that they refused to support the now called LibreOffice.) It is true, though, that they do different things. I'd opt for a database to handle large data and some basic operations just because it would be more efficient at it. Databases also handle indexing and such. While Data.Table supposedly has benefits, I haven't seen that show up in the case we were dealing with those hash tables. Thus, databases are your friend in the case you're needing to merge, search/subset, or aggregate very large sets of data, not to mention you can more appropriately organize your data into normalized tables. R isn't made for these tasks, but that is why you interface with the DB. From that interface, you can send the desired requests, let the DB do the heavy work, and then operate on the result sets to do your analyses. For manageable data sets, R is far superior in its ability to manage complex data types (e.g., factors that store codes and labels as one object as opposed to a simple lookup table). But I digress!

Using maps isn't that hard to do basic stuff. I'd avoid trying to use what ggplot2 offers as a nice interface to the maps package. That way, you can see how nice it is to use ggplot2 to simplify that task! The documentation for the package has some basic examples, though. Hell, example(maps) might suffice!
 

trinker

ggplot2orBust
#52
I think I'll take your advice and will start with the basics in maps and move to the ggplot2 as I progress. Eventually I want to color in some counties on the NYS map with some sort of color representing something (lots of options). I saw this done a while back and fell in love (it was some sort of challenge for unemployment mapping the entire country). Apparently this is called a Choropleth as in the two approaches seen below. Any way I think I'll start small in maps plotting the schools on the map.

PS I included the code for this map that I got from LINK.


Code:
unemployment = read.table("http://www.stanford.edu/class/stats202/data/unemployment.csv", header=TRUE, sep='|')
names(unemployment)
library(maps)
library(ggplot2)

county_df = map_data('county')
state_df = map_data('state')

county_df$state = county_df$region
county_df$county = county_df$subregion

unemployment_jul10 = unemployment[unemployment$period=='Jul-10',]
map_data = merge(county_df, unemployment_jul10, by=c('state', 'county'))
map_data = map_data[order(map_data$order),]
map_data$rate_d = cut(map_data$rate, breaks=c(seq(0,10,by=2),35))
(ggplot(map_data, aes(long, lat, group=group)) +
   geom_polygon(aes(fill=rate_d), colour=alpha('white', 1/2), size=0.2) +
   geom_polygon(data=state_df, colour='white', fill=NA) +
   scale_fill_brewer(pal='PuRd'))
state_df$avg_unemployment_jul10 = numeric(nrow(state_df))
sorted_states = sort(unique(map_data$state))
for (state in sorted_states) {
    state_subs = unemployment_jul10$state==state
    state_df_subs = state_df
    state_df$avg_unemployment_jul10[state_df$region == state] = mean(unemployment_jul10$rate[state_subs])
}
state_df$rate_d = cut(state_df$avg_unemployment_jul10, breaks=c(seq(0,10,by=2),35))

(ggplot(state_df, aes(long, lat, group=group)) +
   geom_polygon(aes(fill=rate_d), colour=alpha('white', 1/2), size=0.2) +
   scale_fill_brewer(pal='PuRd'))

Code:
#     browsURL("https://gist.github.com/233134")
library(ggplot2)
library(maps)

# First (and most annoying) task - get matching state and county variables 
# for both datasets.  And unfortauntely it's not quite right, as you can
# see from the finish product - some counties are missing.
path <- "http://datasets.flowingdata.com/unemployment09.csv"
unemp <- read.csv(file=url(path), header = F, stringsAsFactors = F)
names(unemp) <- c("id", "state_fips", "county_fips", "name", "year", 
  "?", "?", "?", "rate")
unemp$county <- tolower(gsub(" County, [A-Z]{2}", "", unemp$name))
unemp$state <- gsub("^.*([A-Z]{2}).*$", "\\1", unemp$name)

county_df <- map_data("county")
names(county_df) <- c("long", "lat", "group", "order", "state_name", "county")
county_df$state <- state.abb[match(county_df$state_name, tolower(state.name))]
county_df$state_name <- NULL



state_df <- map_data("state")

# Combine together 
choropleth <- merge(county_df, unemp, by = c("state", "county"))
choropleth <- choropleth[order(choropleth$order), ]
# Discretise rate to use with Brewer colour scheme - many options here
# choropleth$rate_d <- cut_number(choropleth$rate, 5)
# choropleth$rate_d <- cut_interval(choropleth$rate, 5)
# Nathan's choice is a little odd:
choropleth$rate_d <- cut(choropleth$rate, breaks = c(seq(0, 10, by = 2), 35))

# Once you have the data in the right format, recreating the plot is straight
# forward.

ggplot(choropleth, aes(long, lat, group = group)) +
  geom_polygon(aes(fill = rate_d), colour = alpha("white", 1/2), size = 0.2) + 
  geom_polygon(data = state_df, colour = "white", fill = NA) +
  scale_fill_brewer(pal = "PuRd")

# Takes a while to draw because ggplot2 not very efficient with large numbers
# of polygons :(
 

bryangoodrich

Probably A Mammal
#53
Using the help page for map and my old blog about grabbing the actual unemployment data, I made a quick script to create a similar map (the bells and whistles can be added later). It looks like I didn't have any matches from the county.fips data set (from the package) for California and some other states? Odd.

EDIT: It looks like county.fips has the leading '0' missing for the FIPS because they're storing them as integers. Thus, I on-the-fly convert my factors to numeric representations; this removes that leading '0' and obtains the rest of the matches (before some states were entirely uncolored).

NOTE: I did not include the mapproj adjustment that is required to make this a more appropriate choropleth map. In fact, it would have made it easier for me to place the title and legend positions from what I ended up doing. Just load the package and add "projection = 'polyconic'" to the map statement to see the (big) difference.

Code:
library(maps)
library(RColorBrewer)


# ========== Obtain 2010 Unemployment Statistics ==========
importBLS <- function(year, ...) 
# Accesses county unemployment data from the BLS FTP site
#
# Arguments:
#     year - '2-digit' character representation of the desired year. Valid for 1990 to 2010.
#     ...  - additional arguments passed to read.fwf
#
# Returns:
#     A data.frame containing cleaned up BLS data
{
    infile <- paste("ftp://ftp.bls.gov/pub/special.requests/la/laucnty", year, ".txt", sep = "")
    df <- read.fwf(url(infile), skip = 6, strip.white = TRUE, ..., 
                   col.names  = c("series_id", "sFIPS", "cFIPS", "name", "year", "labor", "emp", "unemp", "unrate"),
                   colClasses = c("factor", rep("character", 3), "factor", rep("character", 3), "numeric"),
                   widths     = c(8, 5, 8, 53, 4, 14, 13, 11, 9))

    df <- transform(df,
      FIPS  = as.numeric(paste(sFIPS, cFIPS, sep = "")),
      labor = as.numeric(gsub(",", "", labor)),     # Remove formatted strings that include ","
      emp   = as.numeric(gsub(",", "", emp)),
      unemp = as.numeric(gsub(",", "", unemp))
    );  # end transform

    return(df)
}  # end function

df <- importBLS("10")  # Takes nearly 10 seconds to complete



# ========== Prepare Map Data ==========
data(county.fips)

df$bins <- as.numeric(cut(df$unrate, c(seq(0, 10, 2), 100)))  # Classification
pal <- brewer.pal(6, "BuPu")                                  # Color palette
leg.txt <- c("< 2", "2-4", "4-6", "6-8", "8-10", "> 10")      # Legend text
matches <- df$bins[match(county.fips$fips, df$FIPS)]          # Matched Classes



# ========== Map Data ==========
map("county", col = pal[matches], fill = TRUE, bg = "grey95", resolution = 0, lty = 1)
mtext("Unemployment Rate by County", line = 3, cex = 1.2)
mtext("United States, 2010", line = 2)
legend("bottomright", leg.txt, fill = pal, bty = 'n', cex = 0.8, inset = 0.03)
 
Last edited:

bryangoodrich

Probably A Mammal
#54
Here's an alternative map statement

Code:
library(mapproj)
map("county", col = pal[matches], fill = TRUE, 
    bg = "grey95", projection = "polyconic")
mtext("Unemployment Rate by County", cex = 1.2)
mtext("United States, 2010", line = -1)
legend("bottom", leg.txt, fill = pal, bty = 'n', cex = 0.8, inset = 0.1, horiz = TRUE)

If you're thinking about how to color your map, you should see what RColorBrewer uses as its basis: Color Brewer. This is an excellent resource for figuring out how to set up your choropleth map colors. Set your classification size and select HEX.
 

bryangoodrich

Probably A Mammal
#55
I'm working on creating a "mapBLS" function with argument list (data, classes, palette, var) and possibly a 5th for 'classification.' The idea here is that I can generically specify the Color Brewer color palette and number of classes by which to make this choropleth map. This mimics the behavior found in ArcMap when you make the same map: you specify your classification method (I often use quantile), number of classes, and set your colors (I always turn to Color Brewer now). The user also specifies which variable to classify. I may also need to specify which field to match to the enumeration unit (in this case county FIPS).

I want it to also work for any arbitrary numeric vector. Thus, I need to create a classification function so the preparation section can just be

Code:
# Legend text and call should probably be separate from this function or passed as additional parameters
mapBLS <- function(data, classes, palette, var, merge) {
    data$bins <- classification(data[, var], classes)             # Classification
    pal <- brewer.pal(classes, palette)                           # Color palette
    matches <- data$bins[match(county.fips$fips, data[, merge])]  # Matched Classes
    ...
}  # end function

mapBLS(df, 6, "BuPu", "unrate", "FIPS")
Note: Using the importBLS function makes it easy to recreate this now for any of the years. I'd like to be able to simplify this mapping process so I can do a loop and create, say, 6 different plots on a 2x3 grid for comparison. This would be good for a process analysis (see the change in unemployment rate by county across the US over time). I think I might do something like this for my database project to meet the, at minimum (I'm doing way more) additional requirements of "non-ArcGIS interfaces". Effectively, I'm creating a mapping tool through R that interfaces the "database" (more like data store) at the BLS FTP site.
 

bryangoodrich

Probably A Mammal
#56
Turned out easier than I thought once I realized how to use the quantile function in R. See below.

Code:
# Prepares classifications for a vector; can be extended to provide other classifications
classification <- function(x, n, method = "quantile") {
    cutoffs <- quantile(x, seq(0, 1, length = n), na.rm = TRUE)
    cut(x, cutoffs)
}  # end function


# ========== Obtain 2010 Unemployment Statistics ==========
importBLS <- function(year, ...) 
# Accesses county unemployment data from the BLS FTP site
#
# Arguments:
#     year - '2-digit' character representation of the desired year. Valid for 1990 to 2010.
#     ...  - additional arguments passed to read.fwf
#
# Returns:
#     A data.frame containing cleaned up BLS data
{
    infile <- paste("ftp://ftp.bls.gov/pub/special.requests/la/laucnty", year, ".txt", sep = "")
    df <- read.fwf(url(infile), skip = 6, strip.white = TRUE, ..., 
                   col.names  = c("series_id", "sFIPS", "cFIPS", "name", "year", "labor", "emp", "unemp", "unrate"),
                   colClasses = c("factor", rep("character", 3), "factor", rep("character", 4)),
                   widths     = c(8, 5, 8, 53, 4, 14, 13, 11, 9))

    df <- transform(df,
      FIPS   = as.numeric(paste(sFIPS, cFIPS, sep = "")),
      labor  = as.numeric(gsub(",", "", labor)),     # Remove formatted strings that include ","
      emp    = as.numeric(gsub(",", "", emp)),
      unemp  = as.numeric(gsub(",", "", unemp)),
      unrate = as.numeric(unrate)
    );  # end transform

    return(df)
}  # end function




mapBLS <- function(df, classes, palette, var, link, title, subtitle) {
    require(maps)
    require(RColorBrewer)
    require(mapproj)

    # ========== Prepare Map Data ==========
    data(county.fips)

    df$bins <- classification(df[, var], classes)            # Classification
    pal     <- brewer.pal(classes, palette)                  # Color palette
    matches <- df$bins[match(county.fips$fips, df[, link])]  # Matched Classes



    # ========== Make Map ==========
    map("county", col = pal[as.numeric(matches)], fill = TRUE, 
        bg = "grey95", projection = "polyconic")
    if (!is.null(title))    mtext(title, cex = 1.2)
    if (!is.null(subtitle)) mtext(subtitle, line = -1)
    legend("bottom", levels(matches), fill = pal, bty = 'n', cex = 0.8, inset = 0.1, horiz = TRUE)
}

dl <- vector('list', 6)
dl[[1]] <- importBLS("05")
dl[[2]] <- importBLS("06")
dl[[3]] <- importBLS("07")
dl[[4]] <- importBLS("08")
dl[[5]] <- importBLS("09")
dl[[6]] <- importBLS("10")

png("UNRATE.png", 1920, 1200)
par(mfrow = c(2, 3))
lapply(seq(6), function(n) {
  mapBLS(dl[[n]], 6, "BuPu", "unrate", "FIPS", 
         title    = "Unemployment Rate by County",
         subtitle = paste("United States,", switch(n, 2005, 2006, 2007, 2008, 2009, 2010))
  )  # end mapBLS
  return(0)
})  # end lapply
dev.off()
EDIT: I had to add na.rm to the quantile call because some of the data has character values "N.A" in them. I altered the importBLS to account for this, also. I now made an output image that does a 2x3 grid of these plots for the years 2005 to 2006. Nothing too amazing, but that's because the scale (quantile classification) is adjusted for each year. Thus, the distribution of unemployment geographically does not seem to change much, even though there was an obvious national increase.

I should also note that these graphics are NOT normalized. This is actually misleading, because larger counties will of course have more of pretty much everything. Thus, to get an accurate mapping you need to transform (normalize) the data by dividing it by the area of the enumeration unit. In this case, we need to divide each county unemployment rate (or whatever variable was chosen) by the area of that county. The image you will see will differ greatly! I need to see if the maps package includes that information. How to handle the legend after that, I don't know, because usually normalized values are hard to interpret (unemployment rate per square km?).

 

trinker

ggplot2orBust
#57
I have played with mapping. Sadly I could not find good documentation to be able to plot coordinates using just map (I may lack to finding skills). So I went with ggplot2. I have the coordinates plotted and have played with supplying additional symbol adjustments (color by school type and a few basics). It was cool to see that map light up with the points I geocoded. :) Anyway I attempted to facet the map by a grouping variable but failed so I sent and email out to the ggplot2 help list.

I'll probably post an updated script and some images here tomorrow.

I want to create the Choropleth as well. I'm going to do maps for ela and math scores. Right now the scores are by grade level per school (I might be able to use district level data too) but it will require an aggregating by county (got the count codes to parse from the BEDS code) . The problem is the data is in percentages. Not a problem I can multiply the percentages by the n for each school (maybe district) create a new variable(s); n passing and n failing. And aggregate on that by those counts per count. Then get a passing rate per county. That's what I'll do the Choropleth with. Looking forward to it :)

Bryandgoodrich I am open to plotting the coordinates with map but I need a resource. A script you have or a link to one or something would be nice. I tried to find how to plot coordinates and it is probably simple but with out an example I'm in sad shape. I played with the match.map function a little bit; as this seems to be what I'd need for the Choropleth. I liked the looks of your post 54 map better. Nice work. This new function also sounds pretty exciting. Is able to be tried out by others yet?

And oh yeah you mentioned Color Brewer. Until today I never played with it much. It has some very nice features for preset palettes. I liked it a lot. I did want to be able to control the palette too so I found out in ggplot2 you need to use scale_fill_manual.

Anyway I have to work all day tomorrow and get ready for my little girl's 1st birthday (party) on Saturday :) so I may not get to do much with plotting. Once I get comfortable with ggplot and maps I'll make the Choropleth and then move onto using google maps. I see there's two approaches to layering with google maps.
 

bryangoodrich

Probably A Mammal
#58
Well, I updated the above. I haven't tested it extensively, yet. It seems to work for the unemployment rate. Trying to replicate this in ArcGIS would be time consuming! Granted, it would probably be easier with a Python script using a template layout, which I may do for my Python class final assignment ... hmm.
 

trinker

ggplot2orBust
#59
I don't know if it affects anything but it's giving me errors at this point:

Code:
> dl <- vector('list', 6)
> dl[[1]] <- importBLS("05")
Warning messages:
1: In eval(expr, envir, enclos) : NAs introduced by coercion
2: In eval(expr, envir, enclos) : NAs introduced by coercion
3: In eval(expr, envir, enclos) : NAs introduced by coercion
4: In eval(expr, envir, enclos) : NAs introduced by coercion
> dl[[2]] <- importBLS("06")
Warning messages:
1: In eval(expr, envir, enclos) : NAs introduced by coercion
2: In eval(expr, envir, enclos) : NAs introduced by coercion
3: In eval(expr, envir, enclos) : NAs introduced by coercion
4: In eval(expr, envir, enclos) : NAs introduced by coercion
EDIT: It worked bryan. That is slick!
 

bryangoodrich

Probably A Mammal
#60
I could have suppressed those, but like I said, some of the values in the BLS data are "N.A.". I could have (should?) added that as the na parameter in read.table. Instead, they just get coerced to NA as the warning indicates. Bad design on my part, but I was just looking for a usable solution to get it done tonight! I'll probably check if the na parameter is sufficient to avoid errors and warnings.