An R learning project (feel free to learn with me)

trinker

ggplot2orBust
I realized today I seem to learn the most when I have some kind of project that stretches my abilities beyond where they are currently. Like many of you I'm an avid R user and like you I have several things in R I've been meaning to learn and have put off. Well today I came across a Wickham plot (LINK) that inspired me to undertake a project that will require me to stretch and grow.

1. I'm going to need your help
2. I think this could be a great learning experience for everyone.
3. It helps keep me accountable to accomplish this
4. It'll keep everything organized and tidy

The goal(s) is to learn to get data from several websites and use google maps and the ( maps package) (wanted to learn how to do it with both but the google maps one looked pretty professional) and R to plot information from the data set. Specifically I have decided to use schools data from New York State (my home state) and plot the location of the schools and perhaps some different plotting symbols to represent difference in some demographics information.

Here's my plan of attack:
1. Write a function to scrape (I think this is scraping but may not be called that) the school names and their addresses to a nice data frame (LINK). This will require looping through each county and extracting the information. Bryangoodrich shared a similar data retrieval loop a few months back (LINK).
2. Convert the address to lat and longitude. I'm thinking of using the following method:

Code:
require(XML)
doc = xmlTreeParse(url)
root = xmlRoot(doc)
lat = xmlValue(root[['result']][['geometry']][['location']][['lat']])
long = xmlValue(root[['result']][['geometry']][['location']][['lng']])
return(c(lat , long))
}

addr3 = c('1600 Pennsylvania Avenue, Washington, DC' ,
'W Ball Rd & S Disneyland Dr, Anaheim, CA')
3. Read in some other interesting data files such as test info or demographics from the net (The csv should be straight forward but one file is in Microsoft access format so I'd have to figure out how that's done; probably I'll stick with the .csv file). Data found here (LINK) and here (LINK).
4. Use ggplot to plot the data (addresses turned lat and long) on a simple New York state map from the maps package as shown here (LINK). This will require converting the long and lat to a grid using ( mapproj)
5. Repeat # 4 but interfacing with google maps as seen here (LINK).
6. On the maps (from steps 4 and 5) combine the address plots with some interesting demographic data just to add some flare to the map using

If you have any ideas or suggests now or along the way please share. This is just a general plan of attack. As I work through the problem I'll update you and ask for help when I need it.

bugman

Super Moderator
I was trying to do something similar at work on Thursday to plot mine sites in the state.

Thanks fo this great idea trinker. I'll certainly follow this thread, but i think you code is leagues ahead of mine so my relationship is likley to be more parasitic than symbiotic.

Nice job.

P

trinker

ggplot2orBust
alright this is real life learning so....

The link I provided for all the schools in NY and addresses was incorrect. That's just special needs schools.

So I need to find the correct website/data source.

I'll update the progress

UPDATE:
I couldn't find the file in a nice fixed width or csv format. I found it as a pdf found here (LINK). I don't know how to read in a pdf or if it's possible in the format it's in.

I have 4 options the way I see it:
1. Cut and paste the file into a txt file, import it into excel, clean and import into R
2. Copy to txt file and write an R script to address irregularities in the file and clean it (similar to the chat box function Dason wrote here (LINK)
)
3. Figure out how to read in from a pdf (I've heard of this being done)
4. Keep searching and find a better file format with school names and addresses

bryangoodrich

Probably A Mammal
Wow, that looks awesome. I'm definitely down to do this, since I could entirely do part of my database project this semester from R then! I'm automating grabbing local (Sacramento) crime statistics (ZIP archives containing CSVs). I don't know exactly what they contain, but I can automate the geocoding if necessary. I would definitely like to recreate the sort of heat mapping that you showed here.

One note, I'd use JSON instead of XML on that return object. JSON is easier to parse I find, and frankly I want XML to die and be replaced with it lol That's just my preference though

bryangoodrich

Probably A Mammal
The problem with reading PDFs is that they're not formatted. Something may appear as a table, but it really isn't. I would first try to copy-paste it into a text file to get an idea of how it comes out. If there is some structure to it, then you can of course figure out how to parse that. But grabbing data out of a PDF is generally a pain. I'd keep looking for more data, just in case there is a better format of it out there. For instance, I copy-pasted the text and it comes out in spurts. If it were uniform, you could have read it into a matrix with a certain structure, but as is, I'm not sure.

EDIT: Found it. I'll leave it up to you to find it, as you need practice doing this. First tip: Never look for the pdf (they suck!) lol

To facilitate the processing, once you have the data read in you'll want to use the layout file to give appropriate names (know your data; it doesn't have headers) and then grab the address information. If you want to select based on certain parameters, you can of course subset the data. Assuming it is stored in 'x'

Code:
names(x) <- c("officer", "school", "beds", "address", "city", "zip", "area", "phone", "record", "grade")
locations <- transform(x, location = paste(gsub("[[:space:]]", "+", address), city, state, substr(zip, 1, 5), sep = "+"))$location locations <- as.character(locations) Do you see what I did there? Do you recall why I'm putting "+" in there? Are there any records that you should subset 'x' by on principle? (The answer is yes). With locations, you can now geocode them and if you keep the ID's you can tie them back to the original data set for any attribute references you may require. bryangoodrich Probably A Mammal I'll definitely be working with this later as I'm collecting my data this weekend for my project. One thing about your coordinate function, besides using JSON like I suggested, I wouldn't throw away the data. You're making the request and accessing the geometry. Great. But what if you get a status not equal to OK? What if you get a multiple-point return? In any case, there's a lot of good information, and it parses your address string for you (into street name, city, etc.). Therefore, depending on the situation, you may want to just read in the entire request, store it, and then make use of it. So if you know how many addresses you're geocoding, create a list that size. Then fill the list sequentially with each of the calls returning a parsed JSON string. Code: library(RJSONIO) addresses <- scan("addresses.txt", what = "character") # A vector of addresses geocoded <- vector("list", nrow(addresses)) # The final object framework for (address in seq(addresses)) geocoded[[i]] <- fromJSON(requestJSON(addresses[i])) Here we just have to craft requestJSON as a function that calls the Google Geocoding API (possibly with a delay, less we get bad feedback as I discovered) using the address string, possibly replacing whitespaces with "+" marks. There may even be methods in the RCurl library to do that for you. The final object "geocoded" here would be a list, each element another list containing the relevant JSON elements. We would request the geometry in a similar fashion Code: geocoded[[add_index]]$results[[1]]$geometry$location  # Returns a 2-point named vector
The only index that matters is the add_index. The results list element is a list of 1 element, which is why I leave it as '1'. Now which is easier? The XML or JSON?

(To be fair, you can parse the whole XML into a list object pretty easily with a single call, but I think JSON is still the better format.)

Dason

Code:
I think you either want to change your loop index to i (preferable) or change the "i"s in the loop to address (less preferable).

But then again you could get rid of the entire loop and use lapply instead.

bryangoodrich

Probably A Mammal
Opps. I originally was going to do address in addresses (Python-minded), but switched to my R approach and didn't change my index. In any case, the problem with lapply is the list return isn't preallocated (I don't think), but that would be a fine alternative regardless. Then I would do

Code:

trinker

ggplot2orBust
Found it!!!!!!

Thanks for the info bryangoodrich. I got slowed down by having to fix some R issues today. Anyway. It seems I have my pick of data types here. Would you recommend using library(gdata) to read in the xls or read the text file available with what appears to be a comma separator? I think that there's some kind of a key there as well for reading the SEDDIR.TXT file called LAYOUT.TXT. This doesn't entirely make sense to have a key if the data is comma separated. It's nice that this data has a BEDS code because there's a heck of a lot of information stored in that and it can be used as an ID column to merge with demographic data.

I think I'm going to hold off reading in the data until I get a recommendation from someone on the approach that seems optimal (Who am I kidding I'm gonna play with this right now but I still want to learn the preferred approach)

trinker

ggplot2orBust
bryangoodrich said:
Do you see what I did there? Do you recall why I'm putting "+" in there? Are there any records that you should subset 'x' by on principle? (The answer is yes).
No why are you doing this?

bryangoodrich

Probably A Mammal
Found it!!!!!!

Thanks for the info bryangoodrich. I got slowed down by having to fix some R issues today. Anyway. It seems I have my pick of data types here. Would you recommend using library(gdata) to read in the xls or read the text file available with what appears to be a comma separator? I think that there's some kind of a key there as well for reading the SEDDIR.TXT file called LAYOUT.TXT. This doesn't entirely make sense to have a key if the data is comma separated. It's nice that this data has a BEDS code because there's a heck of a lot of information stored in that and it can be used as an ID column to merge with demographic data.
No sir, it IS comma delimited. This is why you look at those supporting files. It says it is CSV. The metadata also tells you what the values and columns should be; though, it did fail to state that the city/state are two different fields! No point in reading in XLS when you have plain text available. XLS is a horrible transport format. Don't get me wrong. I like Excel, but it serves a different purpose as far as I'm concerned (pretty reports and 'applications'). Plain text, xml, and other transport formats are much nicer to deal with.

The BEDS code are unique identifiers. Every table should have one, and a standardized code is always best!

As for why I'm adding "+" for the whitespaces is because when you send a GET request to something like Google's geocoding API you can't send something like "www.url.com/get.php?address=555 Im an address, CA". The string needs to be formatted to something their parser can interpret. Now, maybe they can handle a request like that, but that's nonstandard. Instead, you either fill in whitespaces with "%20" which is a code for " " or with "+" because it's a designator for " ". I prefer "+" to "%20".

Note, I did a quick check by Googling "sacramento ca". The web results show in the URL that they added "+" to my query. Then I switched to maps and it kept a literal whitespace. That is odd, but it shows Google can handle them. It also just depends on how you send your request. As I explain in whatever thread I made talking about this web scraping stuff (which, btw, grabbing a file from a website I'd probably just call importing over a network, but it can be considered a very simple/mild form of web scraping), you could also submit your GET request in a parameterized format (i.e., submit the base URL and then pass named arguments that signify the parameters

Code:
get(url, sensor="false", address="555 some place, ca")
(Web scraping is a bit more complicated than merely importing data because it involves the act of parsing the import in some fashion. Like my hospitals example, the data isn't stored at all, at least not how I'm collecting it. Instead, I'm requesting the HTML text and using grep to find the content based on the regular format they used in presenting that information.)

bryangoodrich

Probably A Mammal
Point in case about web scraping, I just finished running my Linux script to create an attribute table (would be much easier in R! lol) and found there are a few anomalies. Is this one line with only a number part of the line before it with an oddly placed newline or is it a record missing the other data? Does the table have enough records (in my case, 551)? This is part of the data cleaning (or in more technical situations: ETL) process. Now I need to go back to the source data (the website) and check on the few anomalies to make manual corrections.

I also realized in checking now that there are sometimes more additional information (spanning multiple lines which might be why I accidentally grabbed a newline).

trinker

ggplot2orBust
Alright I thought I'd share what I have so far. That's actually the first time I've read in data from the Internet. Still not sure why they give you the code if it's csv. Anyway.

Curiosity: How hard would it be to read in microsoft access info? I wonder if gdata has a method for that as NYS uses acccess files quite a bit.

I'd say I'm somewhere on step two. Against bryangoodrich's strong advisement I used XLM to get the coordinates. I sent the request out. That was 45 minutes ago. I think I may be understanding why he isn't a fan.

The reason I used XLM bryan is because I couldn't quite follow what you were doing with RJSONIO. Sorry. I want to get this right. This one is a bit past my zone of proximal growth and I'm going to ask for more direct mediation if you would be so kind.

Code:
##########
# STEP 1 #
#######################################################################
# Write a function to scrape (I think this is scraping but may not be #
# called that) the school names and their addresses to a nice data    #
# frame (LINK). This will require looping through each county and     #
# extracting the information. Bryangoodrich shared a similar data     #
# retrieval loop a few months back (LINK).                            #
#######################################################################
classes  <- c(rep("character", 9), rep("factor", 2))

dat <- read.csv(file = url(path), header = FALSE, strip.white = TRUE, sep= ",",
na.strings= c(" ", ""),stringsAsFactors = FALSE, col.names = cnames, colClasses = classes)

dat[, 'locations']  <- with(dat, paste(gsub("[[:space:]]", "+", address), city, state, substr(zip, 1, 5), sep = "+"))
dat[, 'locations'] <- as.character(dat[, 'locations'] )
##########
# STEP 2 #
#############################################
# CONVERT THE ADDRESS TO LAT AND LONGITUDE. #
#############################################
##################################################
# MY PLANNED METHOD (BRYANGOODRICH HAS A BETTER  #
# TECHNIQUE AS THIS APPEARS TO BE SLOW)          #
##################################################
require(XML)
doc = xmlTreeParse(url)
root = xmlRoot(doc)
lat = xmlValue(root[['result']][['geometry']][['location']][['lat']])
long = xmlValue(root[['result']][['geometry']][['location']][['lng']])
return(c(lat , long))
}

I think I may have froze R up. We'll see in the morning.

trinker

ggplot2orBust
UPDATE: Got back the results of XLM 1 hour later (should have used the function I learned for emailing myself (and texting me) when it's done. The best part is it returned all NAs and I'm not sure why. Running the script in my original post also now returns NAs. Not sure why this is. IS there a limit of how many I can grab a day?

bryangoodrich

Probably A Mammal

There's a 2,500 daily limit on requests and I've found there's a rate limit, which is why I had to script in a slight delay in a loop when requesting. I read 200 ms should work. Since I'm never in a hurry, I go with 0.1 s. Still, you should get a return object, it should just say the status is over the quota limit. But since you're immediately parsing for a certain output that would only exist when status is OK, you're going to make invalid references (get NA). Thus, do like I said, just return the entire XML object. Store it in a list. Or include a check that the the status is at least OK (see documentation) and return something more appropriate if not.

bryangoodrich

Probably A Mammal
Looking at the line count of your data, there's over 7,000 records. You can't use Google's API. You might want to check out others like I included on my web scraping thread. Luckily, this data is already formatted (broken into street, city, zip, etc.). I know the university's API requires you to enter those separate parameters. You'll still want to make sure the documentation says they can meet the amount you want. Otherwise, you'll want to do this in chunks over the week!

trinker

ggplot2orBust
Update:

Bryan has referenced his web scraping thread a few times so I figured I'd put the (LINK) to it here and an import JSON (LINK) here..

I am going to be going through his thread in detail to wrap my brain around his methods. I'm not a fan of having to do this in chunks over the week (not because I'm in a rush now but I may be when I really want to do this stuff) so I want to avoid the XLM approach or maybe it's the Google API I'm avoiding.

trinker

ggplot2orBust
GETTING RCurl and XLM for Windows

It is likely that if you're following along you'll use XLM and/or RCurl.

Some of you may be a windows user as I am (I do use Ubuntu and eventually will get all the way there) and getting your virtual hands on a copy of XLM and RCurl takes a little more time as they're not compiled for windows. Professor Brian Ripely has compile them at:
http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.14/

Dason

I actually thought getting those things required less work on Ubuntu. I'm not sure if it would be similar on a different distro but working with R is a lot easier on Linux in my experience.

trinker

ggplot2orBust
In another thread LINK bryangoodrich discusses the problem with google API and being limited to a certain volume of requests per day. I think I'm going to go the route he suggests using North American Locator. I still am reading about this before I generate code.

Turns out the reason I get "over quota limit" responses from the API is because the service is intended for website use only, not user. This means they expect requests to come from a website that then puts the return object on a map or something. Thus, I can receive about 12 requests before it hangs for another 12. The only way I can think of to avoid this is to either find another geocoding service like USC geocoding or ESRI--the latter, I believe, requires my software license or something--or I could try and find a way to pause my transfer. I heard waiting 200ms in between requests should work. The problem is, when doing it from a file for batch processing, I can only set an option, as far as I can see, to slow down the transfer for a given request, not in between request. I may need to use something like R or Python to use cURL and use their pausing capabilities. This would step away from my goal of only using basic Linux tools, however. Choices ...

EDIT: On this topic, I recommend the interested reader to look at Google's Gecoding Strategies. While I could get away with this, and it was a good exercise in learning how to process stuff through various web resources, I may turn to the USC link above or to ESRI as I do have access to their software, just not from home (it's not usable on Linux). Since I'm trying to remain open source, I may go the university route.

EDIT: 30 January 2012

I mentioned ESRI. Here is the link to their North American Locator.
EDIT 1: I found out the act of converting addresses to coordinates is called geocoding which may be helpful for searches on the subject.

EDIT 2: This is a learning project and it's definitely making me learn. I'm having to read around google API and geocoding quite a bit. I'm finding two web sites thus far about Google API and geocoding to be helpful and I thought I'd chare: