# Thread: An R learning project (feel free to learn with me)

1. ## An R learning project (feel free to learn with me)

I realized today I seem to learn the most when I have some kind of project that stretches my abilities beyond where they are currently. Like many of you I'm an avid R user and like you I have several things in R I've been meaning to learn and have put off. Well today I came across a Wickham plot (LINK) that inspired me to undertake a project that will require me to stretch and grow.

1. I'm going to need your help
2. I think this could be a great learning experience for everyone.
3. It helps keep me accountable to accomplish this
4. It'll keep everything organized and tidy

The goal(s) is to learn to get data from several websites and use google maps and the ( maps package) (wanted to learn how to do it with both but the google maps one looked pretty professional) and R to plot information from the data set. Specifically I have decided to use schools data from New York State (my home state) and plot the location of the schools and perhaps some different plotting symbols to represent difference in some demographics information.

Here's my plan of attack:
1. Write a function to scrape (I think this is scraping but may not be called that) the school names and their addresses to a nice data frame (LINK). This will require looping through each county and extracting the information. Bryangoodrich shared a similar data retrieval loop a few months back (LINK).
2. Convert the address to lat and longitude. I'm thinking of using the following method:

Code:
``````coord <- function(address){
require(XML)
doc = xmlTreeParse(url)
root = xmlRoot(doc)
lat = xmlValue(root[['result']][['geometry']][['location']][['lat']])
long = xmlValue(root[['result']][['geometry']][['location']][['lng']])
return(c(lat , long))
}

addr3 = c('1600 Pennsylvania Avenue, Washington, DC' ,
'W Ball Rd & S Disneyland Dr, Anaheim, CA')
3. Read in some other interesting data files such as test info or demographics from the net (The csv should be straight forward but one file is in Microsoft access format so I'd have to figure out how that's done; probably I'll stick with the .csv file). Data found here (LINK) and here (LINK).
4. Use ggplot to plot the data (addresses turned lat and long) on a simple New York state map from the maps package as shown here (LINK). This will require converting the long and lat to a grid using ( mapproj)
5. Repeat # 4 but interfacing with google maps as seen here (LINK).
6. On the maps (from steps 4 and 5) combine the address plots with some interesting demographic data just to add some flare to the map using

If you have any ideas or suggests now or along the way please share. This is just a general plan of attack. As I work through the problem I'll update you and ask for help when I need it.

2. ## The Following 6 Users Say Thank You to trinker For This Useful Post:

bryangoodrich (02-04-2012), bugman (02-04-2012), Lazar (02-05-2012), ledzep (02-08-2012), SmoothJohn (02-08-2012), vinux (02-05-2012)

3. ## Re: An R learning project (feel free to learn with me)

I was trying to do something similar at work on Thursday to plot mine sites in the state.

Thanks fo this great idea trinker. I'll certainly follow this thread, but i think you code is leagues ahead of mine so my relationship is likley to be more parasitic than symbiotic.

Nice job.

P

4. ## Re: An R learning project (feel free to learn with me)

alright this is real life learning so....

The link I provided for all the schools in NY and addresses was incorrect. That's just special needs schools.

So I need to find the correct website/data source.

I'll update the progress

UPDATE:
I couldn't find the file in a nice fixed width or csv format. I found it as a pdf found here (LINK). I don't know how to read in a pdf or if it's possible in the format it's in.

I have 4 options the way I see it:
1. Cut and paste the file into a txt file, import it into excel, clean and import into R
2. Copy to txt file and write an R script to address irregularities in the file and clean it (similar to the chat box function Dason wrote here (LINK)
)
3. Figure out how to read in from a pdf (I've heard of this being done)
4. Keep searching and find a better file format with school names and addresses

5. ## Re: An R learning project (feel free to learn with me)

Wow, that looks awesome. I'm definitely down to do this, since I could entirely do part of my database project this semester from R then! I'm automating grabbing local (Sacramento) crime statistics (ZIP archives containing CSVs). I don't know exactly what they contain, but I can automate the geocoding if necessary. I would definitely like to recreate the sort of heat mapping that you showed here.

One note, I'd use JSON instead of XML on that return object. JSON is easier to parse I find, and frankly I want XML to die and be replaced with it lol That's just my preference though :P

6. ## The Following User Says Thank You to bryangoodrich For This Useful Post:

trinker (02-04-2012)

7. ## Re: An R learning project (feel free to learn with me)

The problem with reading PDFs is that they're not formatted. Something may appear as a table, but it really isn't. I would first try to copy-paste it into a text file to get an idea of how it comes out. If there is some structure to it, then you can of course figure out how to parse that. But grabbing data out of a PDF is generally a pain. I'd keep looking for more data, just in case there is a better format of it out there. For instance, I copy-pasted the text and it comes out in spurts. If it were uniform, you could have read it into a matrix with a certain structure, but as is, I'm not sure.

EDIT: Found it. I'll leave it up to you to find it, as you need practice doing this. First tip: Never look for the pdf (they suck!) lol

To facilitate the processing, once you have the data read in you'll want to use the layout file to give appropriate names (know your data; it doesn't have headers) and then grab the address information. If you want to select based on certain parameters, you can of course subset the data. Assuming it is stored in 'x'

Code:
``````names(x) <- c("officer", "school", "beds", "address", "city", "zip", "area", "phone", "record", "grade")
locations <- transform(x, location = paste(gsub("[[:space:]]", "+", address), city, state, substr(zip, 1, 5), sep = "+"))\$location
locations <- as.character(locations)``````
Do you see what I did there? Do you recall why I'm putting "+" in there? Are there any records that you should subset 'x' by on principle? (The answer is yes).

With locations, you can now geocode them and if you keep the ID's you can tie them back to the original data set for any attribute references you may require.

8. ## The Following User Says Thank You to bryangoodrich For This Useful Post:

trinker (02-04-2012)

9. ## Re: An R learning project (feel free to learn with me)

I'll definitely be working with this later as I'm collecting my data this weekend for my project. One thing about your coordinate function, besides using JSON like I suggested, I wouldn't throw away the data. You're making the request and accessing the geometry. Great. But what if you get a status not equal to OK? What if you get a multiple-point return? In any case, there's a lot of good information, and it parses your address string for you (into street name, city, etc.). Therefore, depending on the situation, you may want to just read in the entire request, store it, and then make use of it. So if you know how many addresses you're geocoding, create a list that size. Then fill the list sequentially with each of the calls returning a parsed JSON string.

Code:
``````library(RJSONIO)
geocoded <- vector("list", nrow(addresses))  # The final object framework
Here we just have to craft requestJSON as a function that calls the Google Geocoding API (possibly with a delay, less we get bad feedback as I discovered) using the address string, possibly replacing whitespaces with "+" marks. There may even be methods in the RCurl library to do that for you. The final object "geocoded" here would be a list, each element another list containing the relevant JSON elements. We would request the geometry in a similar fashion

Code:
``geocoded[[add_index]]\$results[[1]]\$geometry\$location  # Returns a 2-point named vector``
The only index that matters is the add_index. The results list element is a list of 1 element, which is why I leave it as '1'. Now which is easier? The XML or JSON?

(To be fair, you can parse the whole XML into a list object pretty easily with a single call, but I think JSON is still the better format.)

10. ## Re: An R learning project (feel free to learn with me)

Originally Posted by bryangoodrich
Code:
``````
I think you either want to change your loop index to i (preferable) or change the "i"s in the loop to address (less preferable).

But then again you could get rid of the entire loop and use lapply instead.

11. ## Re: An R learning project (feel free to learn with me)

Opps. I originally was going to do address in addresses (Python-minded), but switched to my R approach and didn't change my index. In any case, the problem with lapply is the list return isn't preallocated (I don't think), but that would be a fine alternative regardless. Then I would do

Code:
``geocoded <- lapply(addresses, function(address) fromJSON(requestJSON(address)))``

12. ## Re: An R learning project (feel free to learn with me)

Found it!!!!!!

Thanks for the info bryangoodrich. I got slowed down by having to fix some R issues today. Anyway. It seems I have my pick of data types here. Would you recommend using library(gdata) to read in the xls or read the text file available with what appears to be a comma separator? I think that there's some kind of a key there as well for reading the SEDDIR.TXT file called LAYOUT.TXT. This doesn't entirely make sense to have a key if the data is comma separated. It's nice that this data has a BEDS code because there's a heck of a lot of information stored in that and it can be used as an ID column to merge with demographic data.

I think I'm going to hold off reading in the data until I get a recommendation from someone on the approach that seems optimal (Who am I kidding I'm gonna play with this right now but I still want to learn the preferred approach)

13. ## Re: An R learning project (feel free to learn with me)

Originally Posted by bryangoodrich
Do you see what I did there? Do you recall why I'm putting "+" in there? Are there any records that you should subset 'x' by on principle? (The answer is yes).
No why are you doing this?

14. ## Re: An R learning project (feel free to learn with me)

Originally Posted by trinker
Found it!!!!!!

Thanks for the info bryangoodrich. I got slowed down by having to fix some R issues today. Anyway. It seems I have my pick of data types here. Would you recommend using library(gdata) to read in the xls or read the text file available with what appears to be a comma separator? I think that there's some kind of a key there as well for reading the SEDDIR.TXT file called LAYOUT.TXT. This doesn't entirely make sense to have a key if the data is comma separated. It's nice that this data has a BEDS code because there's a heck of a lot of information stored in that and it can be used as an ID column to merge with demographic data.
No sir, it IS comma delimited. This is why you look at those supporting files. It says it is CSV. The metadata also tells you what the values and columns should be; though, it did fail to state that the city/state are two different fields! No point in reading in XLS when you have plain text available. XLS is a horrible transport format. Don't get me wrong. I like Excel, but it serves a different purpose as far as I'm concerned (pretty reports and 'applications'). Plain text, xml, and other transport formats are much nicer to deal with.

The BEDS code are unique identifiers. Every table should have one, and a standardized code is always best!

As for why I'm adding "+" for the whitespaces is because when you send a GET request to something like Google's geocoding API you can't send something like "www.url.com/get.php?address=555 Im an address, CA". The string needs to be formatted to something their parser can interpret. Now, maybe they can handle a request like that, but that's nonstandard. Instead, you either fill in whitespaces with "%20" which is a code for " " or with "+" because it's a designator for " ". I prefer "+" to "%20".

Note, I did a quick check by Googling "sacramento ca". The web results show in the URL that they added "+" to my query. Then I switched to maps and it kept a literal whitespace. That is odd, but it shows Google can handle them. It also just depends on how you send your request. As I explain in whatever thread I made talking about this web scraping stuff (which, btw, grabbing a file from a website I'd probably just call importing over a network, but it can be considered a very simple/mild form of web scraping), you could also submit your GET request in a parameterized format (i.e., submit the base URL and then pass named arguments that signify the parameters

Code:
``get(url, sensor="false", address="555 some place, ca")``
(Web scraping is a bit more complicated than merely importing data because it involves the act of parsing the import in some fashion. Like my hospitals example, the data isn't stored at all, at least not how I'm collecting it. Instead, I'm requesting the HTML text and using grep to find the content based on the regular format they used in presenting that information.)

15. ## The Following User Says Thank You to bryangoodrich For This Useful Post:

trinker (02-05-2012)

16. ## Re: An R learning project (feel free to learn with me)

Point in case about web scraping, I just finished running my Linux script to create an attribute table (would be much easier in R! lol) and found there are a few anomalies. Is this one line with only a number part of the line before it with an oddly placed newline or is it a record missing the other data? Does the table have enough records (in my case, 551)? This is part of the data cleaning (or in more technical situations: ETL) process. Now I need to go back to the source data (the website) and check on the few anomalies to make manual corrections.

I also realized in checking now that there are sometimes more additional information (spanning multiple lines which might be why I accidentally grabbed a newline).

17. ## Re: An R learning project (feel free to learn with me)

Alright I thought I'd share what I have so far. That's actually the first time I've read in data from the Internet. Still not sure why they give you the code if it's csv. Anyway.

Curiosity: How hard would it be to read in microsoft access info? I wonder if gdata has a method for that as NYS uses acccess files quite a bit.

I'd say I'm somewhere on step two. Against bryangoodrich's strong advisement I used XLM to get the coordinates. I sent the request out. That was 45 minutes ago. I think I may be understanding why he isn't a fan.

The reason I used XLM bryan is because I couldn't quite follow what you were doing with RJSONIO. Sorry. I want to get this right. This one is a bit past my zone of proximal growth and I'm going to ask for more direct mediation if you would be so kind.

Code:
``````##########
# STEP 1 #
#######################################################################
# Write a function to scrape (I think this is scraping but may not be #
# called that) the school names and their addresses to a nice data    #
# frame (LINK). This will require looping through each county and     #
# extracting the information. Bryangoodrich shared a similar data     #
# retrieval loop a few months back (LINK).                            #
#######################################################################
classes  <- c(rep("character", 9), rep("factor", 2))

dat <- read.csv(file = url(path), header = FALSE, strip.white = TRUE, sep= ",",
na.strings= c(" ", ""),stringsAsFactors = FALSE, col.names = cnames, colClasses = classes)

dat[, 'locations']  <- with(dat, paste(gsub("[[:space:]]", "+", address), city, state, substr(zip, 1, 5), sep = "+"))
dat[, 'locations'] <- as.character(dat[, 'locations'] )
##########
# STEP 2 #
#############################################
# CONVERT THE ADDRESS TO LAT AND LONGITUDE. #
#############################################
##################################################
# MY PLANNED METHOD (BRYANGOODRICH HAS A BETTER  #
# TECHNIQUE AS THIS APPEARS TO BE SLOW)          #
##################################################
require(XML)
doc = xmlTreeParse(url)
root = xmlRoot(doc)
lat = xmlValue(root[['result']][['geometry']][['location']][['lat']])
long = xmlValue(root[['result']][['geometry']][['location']][['lng']])
return(c(lat , long))
}

I think I may have froze R up. We'll see in the morning.

18. ## Re: An R learning project (feel free to learn with me)

UPDATE: Got back the results of XLM 1 hour later (should have used the function I learned for emailing myself (and texting me) when it's done. The best part is it returned all NAs and I'm not sure why. Running the script in my original post also now returns NAs. Not sure why this is. IS there a limit of how many I can grab a day?

19. ## Re: An R learning project (feel free to learn with me)

There's a 2,500 daily limit on requests and I've found there's a rate limit, which is why I had to script in a slight delay in a loop when requesting. I read 200 ms should work. Since I'm never in a hurry, I go with 0.1 s. Still, you should get a return object, it should just say the status is over the quota limit. But since you're immediately parsing for a certain output that would only exist when status is OK, you're going to make invalid references (get NA). Thus, do like I said, just return the entire XML object. Store it in a list. Or include a check that the the status is at least OK (see documentation) and return something more appropriate if not.