An R learning project (feel free to learn with me)

trinker

ggplot2orBust
#21
I am getting hung up on post 6:8 of this thread. I have an approach that uses XLM and Google API but it has the draw back of limited number of requests per day and being hated by bryangoodrich.

Bryan can you tell me why RJSONIO is a better approach than XLM (I think both would work but I want the better choice)?

Because I've taxed my google API limit for the day I'm at a standstill. I plan on approaching it with the API but also want a better method without limits on daily grabs of geocoding as I get with Google API. I'd like to pursue the North American Locator method but am unsure of how to approach. With the RJSONIO method I'm not really sure what's going on in the code. I I was attempting something but get hung up on what addresses vs address is and loops and ahhhh :eek:

Any way here's my latest attempt. The problem lies in me not understanding the whole RJSONIO interface. Alright time for wings and football. By the way I'm from Buffalo so what the rest of you refer to as Buffalo wings we simply call wings :)

Code:
geocoded <- lapply(dat$locations, function(address) fromJSON(requestJSON(address))) Error in fromJSON(requestJSON(address)) : 
  error in evaluating the argument 'content' in selecting a method for function 'fromJSON': Error: could not find function "requestJSON"
EDIT I have found out more about geocoding with RCurl & RJSONIO from good ol' Stack Overflow and wanted to share LINK Still a waiting game though because I've exceed Google APIs 2500 limit. I definitely want an alternative to Google API that works.
 

trinker

ggplot2orBust
#22
I actually thought getting those things required less work on Ubuntu. I'm not sure if it would be similar on a different distro but working with R is a lot easier on Linux in my experience.
Dason I agree whole heatedly. R's not the problem it's all the other tasks I use that I've become acquainted with in Windows that I'm not sure how or if they work in a Linux based OS. for instance I don't know if microsoft products area viewable (docx) in a Linux OS [wee maybe a bad example because of Open Office. But the point it I need to spend some time in the system when I don't have so many other things vying for my time and it's safer to experiment.

When I said I use Ubuntu I meant I have the hard drive quasi partitioned to run Ubuntu but generally I boot up in Windows. The pain was in getting these packages for windows. Getting them for Linux is easy as they are easily gathered from CRAN.
 

bryangoodrich

Probably A Mammal
#23
First, you can get XML and RCurl and other Omegahat packages in Windows by simply playing with your repositories where you download packages. It pops up a selection menu that you can include Omegahat in the options. Then do install.packages as if you were grabbing stuff from the regular repositories.

Second, the "requestJSON" I used in that example doesn't exist. If you read what I said, I was saying you need to define that function. I was giving an example of the sort of approach you will take. The fact is, neither XML or RJSONIO have capabilities to really access web content, per se. This is why you should use RCurl. It just so happens that the requests we're using are GET requests which come as fully qualified URLs. Thus, you could literally put the URL in your web browser and it will return the content you're trying to parse in R. This is why you can specify the XML to access the URL. It is like grabbing a CSV file over the internet with read.csv. It isn't that the read has web capabilities, per se. It is that the connection to the URL happens to be the required document type that the import method can interpret (CSV in that case, and XML in the other). Does that make sense?


Note: I suggest working on my JSON example (using my website) to get familiar with GET requests. That is at the heart of what we're doing. I say that because whether you're using GET or not, you need to be able to send web requests. For instance, an API may require a POST request which sends more than just a URL and header. It can take a message body with more complex request statements (content). Nevertheless, the return object could be the same. I've been using GET examples and Google (as well as many other API) use GET requests because they're simple and easy to set up (you can literally put the URL in your browser and see your results, depending on what is returned).
 

bryangoodrich

Probably A Mammal
#24
Using the University of Southern California's geocoding web services, they specify a ton of parameters and return content. I'm kind of impressed. Don't bulk process anything until you understand what you're doing and how to handle ONE request. The table specifies the parameters. Notice which are required. Using CSU Sacramento's address (as I did in the JSON example), we can specify the URL as

Code:
https://webgis.usc.edu/Services/Geocode/WebService/GeocoderWebServiceHttpNonParsed_V02_96.aspx?streetAddress=6000+J+Street&city=Sacramento&state=CA&zip=95819&census=false&format=xml&version=2.96
Seriously, I'm impressed. You could request the return object to be CSV or even KML among others! (KML is an immediate spatial transport format because it can be read directly into Google Earth or converted to a shapefile in ArcGIS, among other things. It is Google's standard for spatial data that has a format similar to XML. It's a subset of a larger spatial standard, however.)

I also left off my API key you need to request to use their services (&apiKey=....). I recommend checking out the examples they provide in the various formats. You should also look into their bulk geocoding and see what their policies are so as not to go over any more limits!
 

bryangoodrich

Probably A Mammal
#25
To expand on my example above (using USC geocoder), I simply do the RCurl method getURL again with the fully qualified URL I specified above (including my API key that is required). This stores an XML string into a variable. I can then use xmlToList to convert it to a list object, akin to the fromJSON. However, the first part of the request document was 4 characters I didn't need (unprintable characters). Thus, I did

Code:
xmlRequest <- getURL("... url here ...")
x <- xmlToList(substr(xmlRequest, 4, nchar(xmlRequest)))
It worked beautifully and resulted in

Code:
str(xmlToList(substr(xmlRequest, 4, nchar(xmlRequest))))
List of 5
 $ QueryMetadata:List of 5
  ..$ TransactionId       : chr "[COLOR="sandybrown"]not letting you see my key![/COLOR]"
  ..$ Version             : chr "2.96"
  ..$ QueryStatusCodeValue: chr "200"
  ..$ ErrorMessage        : NULL
  ..$ TimeTaken           : chr "0.03125"
 $ InputAddress :List of 4
  ..$ StreetAddress: chr "6000 J Street"
  ..$ City         : chr "Sacramento"
  ..$ State        : chr "CA"
  ..$ Zip          : chr "95819"
 $ OutputGeocode:List of 15
  ..$ Latitude                                 : chr "38.5662993270105"
  ..$ Longitude                                : chr "-121.428366188347"
  ..$ MatchScore                               : chr "100"
  ..$ MatchType                                : chr "Exact"
  ..$ FeatureMatchingGeographyType             : chr "StreetSegment"
  ..$ InterpolationType                        : chr "LinearInterpolation"
  ..$ InterpolationSubType                     : chr "LinearInterpolationAddressRange"
  ..$ MatchedLocationType                      : chr "LOCATION_TYPE_STREET_ADDRESS"
  ..$ FeatureMatchingResultType                : chr "Success"
  ..$ FeatureMatchingResultCount               : chr "1"
  ..$ FeatureMatchingResultTypeNotes           : NULL
  ..$ TieHandlingStrategyType                  : chr "RevertToHierarchy"
  ..$ FeatureMatchingResultTypeTieBreakingNotes: NULL
  ..$ FeatureMatchingSelectionMethod           : chr "FeatureClassBased"
  ..$ FeatureMatchingSelectionMethodNotes      : NULL
 $ CensusValues :List of 13
  ..$ CensusTimeTaken : chr "0"
  ..$ CensusYear      : chr "Unknown"
  ..$ CensusBlock     : NULL
  ..$ CensusBlockGroup: NULL
  ..$ CensusTract     : NULL
  ..$ CensusCountyFips: NULL
  ..$ CensusStateFips : NULL
  ..$ CensusCbsaFips  : NULL
  ..$ CensusCbsaMicro : NULL
  ..$ CensusMcdFips   : NULL
  ..$ CensusMetDivFips: NULL
  ..$ CensusMsaFips   : NULL
  ..$ CensusPlaceFips : NULL
 $ .attrs       : Named chr "2.96"
  ..- attr(*, "names")= chr "version"
As you can see, I can get an equivalent result to the Google JSON with

Code:
c(Lat = x$OutputGeocode$Lat, Long = x$OutputGeocode$Long)
# USC    Returned: (38.56630, -121.42837)
# Google Returned: (38.56567, -121.42564)
 

bryangoodrich

Probably A Mammal
#26
I did a quick look at the North American Locator. I'll leave it to you to read through their documentation (see the 'rest' and 'soap' links to go to a sample service. The documentation isn't clear, but there are some built-in features about how it works. There's also the fact it returns guesses, not exact matches (though, I'm sure there's something you could pass to say "return the most relevant/first locator."

As you'll see from the link below to CSU Sacramento, there are a number of return locations based on the information I passed. Also note the addition to the base locator URL I had to make "the findAddressCandidates?". There may be other things I could use instead, but as I'll need to learn more about this, and if it's even supposed to be used in this way, I'll have to get back to you.

Code:
http://tasks.arcgisonline.com/ArcGIS/rest/services/Locators/TA_Address_NA_10/GeocodeServer/findAddressCandidates?f=json&Address=6000+J+Street&City=Sacramento&State=CA&Zip=95819&pretty=true
Returns this pretty JSON string.

EDIT: Looked into it, you have to subscribe to a service to do batch geocoding (see the pages for this section). Otherwise, you're allowed 1,000 requests per year on this open task server. I'd recommend the Google API (though, batch geocoding isn't it's intended use), Yahoo Maps API, or even Bing might have one. Otherwise, sign up for a key and use the USC API. They're pretty good, actually. I'll look into their batch geocoding services later this week.
 

trinker

ggplot2orBust
#27
bryangoodrich said:
Don't bulk process anything until you understand what you're doing and how to handle ONE request.
Curse you bryangoodrich where was this good advice when I foolishly sent off my entire request. Shoot first and ask questions later isn't always a good approach :(

bryangoodrich said:
Second, the "requestJSON" I used in that example doesn't exist. If you read what I said, I was saying you need to define that function. I was giving an example of the sort of approach you will take. The fact is, neither XML or RJSONIO have capabilities to really access web content, per se.
Gottcha. I didn't write this myself but took it from SO (see below). I think it will work, as it looks similar to what you've done, but will have to wait until I'm allowed to request from API again. It does not use get like you said so I have to learn more about that by looking at your example. Where is GET at? I went here and don't see it. Sorry to tax your patience as the more I look over all of your threads you have most of this information there I just wasn't connecting the dots.

Code:
###########################################################
library(RCurl); library(RJSONIO)

construct.geocode.url <- function(address, return.call = "json", sensor = "false") {
  root <- "http://maps.google.com/maps/api/geocode/"
  u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
  return(URLencode(u))
}

gGeoCode <- function(address) {
  u <- construct.geocode.url(address)
  doc <- getURL(u)
  x <- fromJSON(doc,simplify = FALSE)
  lat <- x$results[[1]]$geometry$location$lat
  lng <- x$results[[1]]$geometry$location$lng
  return(c(lat, lng))
}

coordinates <- sapply(tolower(dat['locations'][1:2500]), gGeoCode)
#====================================================================
I really need my API requestability to be returned so I can play with all the stuff in your threads.
 

bryangoodrich

Probably A Mammal
#28
The GET request IS the URL. When you look at the JSON or XML in these examples in your web browser, the URL is enough to return the requested object (the browser knows how to send the required header). That is all there is to a GET request. They're simple. When you use the RCurl wrapper getURL that IS the GET request. It is making use of the libcurl library. You can do the same thing in Linux using curl (sudo apt-get install curl) like I did in my web scraping examples (you could also get away in the simple case by using wget).

You can always make use of my website for a simple example: http://www.bryangoodrich.com/api/get.php?format=json&limit=2

What do you see in your browser when you view that? Exactly what you'd return with getURL.

Code:
(x <- fromJSON(getURL("http://www.bryangoodrich.com/api/get.php?format=json")))
# $posts
# $posts[[1]]
# $posts[[1]]$post
#           name    awesomeness     profession         status 
#      "Trinker"      "lacking" "velociraptor"        "taken" 
# 
# $posts[[2]]
# $posts[[2]]$post
#               name        awesomeness         profession             status 
#            "Dason" "Does Not Compute"            "Robot"            "D-Bot" 
# 
# $posts[[3]]
# $posts[[3]]$post
#            name     awesomeness      profession          status 
# "bryangoodrich" "mind blowing!"   "Data Master"        "Single"
or

Code:
(x <- xmlToList(getURL("http://www.bryangoodrich.com/api/get.php?format=xml")))
#             person         person             person         
# name        "Trinker"      "Dason"            "bryangoodrich"
# awesomeness "lacking"      "Does Not Compute" "mind blowing!"
# profession  "velociraptor" "Robot"            "Data Master"  
# status      "taken"        "D-Bot"            "Single"
Note, don't be deceived, that's not a table up there. It's a list of singular entries. I could also have done xmlToDataFrame that would return a numbered table (frame). It just depends on what the return object is, and the stuff from my website is both tabular and simple.
 

bryangoodrich

Probably A Mammal
#29
All that example from SO is doing is setting up the URL. The root path of the URL takes us to the geocoding API, but when you want it in JSON you follow it with "json?" and then pass it the parameters. As I've explained to you before, the parameters are nothing but a named list separated by "&" and whitespaces filled with "+". Look at what the script you provided does when you pass it the address to CSU Sacramento I showed earlier. It sets up the same URL. That is all it is doing, providing a wrapper for setting up the URL. You then parse it and grab the information you want from the list object that it gets converted to. The fact is, you may want more information than just lat-long, but maybe you don't. In any case, it is no more difficult in theory than what I showed from my own website in the examples above.
 

trinker

ggplot2orBust
#30
Let me start by thanking bryangoodrich for being patient with me and explaining all this :tup:. I'm ok with R coding but outside of that (HTLM, python, simple things many of you consider old hat) I'm not schooled in. Remember last summer when Dason had to teach me you could manipulate url's to use R to query things?

Alright today was a lot of reading bryangoodrich's thread(s), rereading them, trying to figure out what APIs and geocoding are figuring out RJYSON and the difference between GET and getURL. I'm running the script as is with RCurl and RJSONIO function but with 2400 requests instead so I can still play tomorrow.

I'm on step 2 and have one (kinda 2) approach (really the same approach using different packages) that would work for geo coding but don't return the information bryangoodrich has discussed. I also have to be limited to 2500 requests per day. Because of these problems I see two subset goals of this step I need to accomplish before I move on:

STEP 2 Substeps
  1. Tear apart bryangoodrich's code from his scraping and Import JSON example threads. Use what he's said above to create code/function (basically I'm going to steal what he's done already :)) to return the additional information.
    [*]
    Figure out how to use University of Southern California's geocoding web services in my script. I've signed up for a key and briefly tinkered but did not have success yet. I'll reread what bryan posted on that tomorrow.

Here's the working code up to this point:

Code:
##########
# STEP 1 #
#######################################################################
# Write a function to scrape (I think this is scraping but may not be #
# called that) the school names and their addresses to a nice data    #
# frame (LINK). This will require looping through each county and     #
# extracting the information. Bryangoodrich shared a similar data     #
# retrieval loop a few months back (LINK).                            #
#######################################################################
path <- "http://www.nysed.gov/admin/SEDdir.txt"
cnames <- c("adminstrator", "school", "beds", "address", "city", "state", 
              "zip", "area", "phone", "record", "grade")
classes  <- c(rep("character", 9), rep("factor", 2))

dat <- read.csv(file = url(path), header = FALSE, strip.white = TRUE, sep= ",", 
           na.strings= c(" ", ""),stringsAsFactors = FALSE, col.names = cnames, colClasses = classes)

dat[, 'locations']  <- with(dat, paste(gsub("[[:space:]]", "+", address), city, state, substr(zip, 1, 5), sep = "+"))
dat[, 'locations'] <- as.character(dat[, 'locations'] )
#######################################################################################################################
# CONVERT THE ADDRESS TO LAT AND LONGITUDE. #
#######################################################################################################################
#########################################################################
# NOTE: FOR METHODS 1 AND 2 I AM USING GOOGLE API THAT HAS A LIMIT OF   #
# 2500 REQUESTS PER DAY. I AM LOOKING INTO USING OTHER API WEB SERVICES #
# MENTIONED ABOVE AND WILL ADD THEM IF/WHEN I HAVE SUCCESSFULLY         #
# IMPLIMENTED THEM.                                                     #
#########################################################################
# METHOD 1 (Using XLM) #
########################
coord <- function(address){
    require(XML) 
    url = paste('http://maps.google.com/maps/api/geocode/xml?address=', 
             address,'&sensor=false',sep='') 
    doc = xmlTreeParse(url) 
    root = xmlRoot(doc) 
    lat = xmlValue(root[['result']][['geometry']][['location']][['lat']]) 
    long = xmlValue(root[['result']][['geometry']][['location']][['lng']]) 
    return(c(lat , long))
}

addresses  = with(dat, paste(address, city, state, sep = ", "))
coordinates <- sapply(tolower(adresses[1:2500]), coord)
###########################################################
# METHOD 2 (Using RCurl & RJSONIO)   #
# This is the method I will utilize  #
######################################
library(RCurl); library(RJSONIO)

construct.geocode.url <- function(address, return.call = "json", sensor = "false") {
  root <- "http://maps.google.com/maps/api/geocode/"
  u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
  return(URLencode(u))
}

gGeoCode <- function(address) {
  u <- construct.geocode.url(address)
  doc <- getURL(u)
  x <- fromJSON(doc,simplify = FALSE)
  lat <- x$results[[1]]$geometry$location$lat
  lng <- x$results[[1]]$geometry$location$lng
  return(c(lat, lng))
}

coord.1.2400 <- sapply(dat[, 'locations'][1:2400], gGeoCode)
PS I am using my email function (first time using it on my own project) to text and email me when the API geocoding function is complete and send me a text file of the first 100 cases. I'm probably the only one who cares about that but...
 

bryangoodrich

Probably A Mammal
#31
Like I said, you should first understand what you're doing with ONE request before you try doing a bunch (or at least only 2 or 3). Once you have success with your test cases, it's just a simple matter of automating it. Also, in the gGeoCode, it won't work if you use any other return.call value, because the XML object parsed into R has a slightly different structure: just compare the namespace used to access latitude from the JSON object versus the namespace used to access it from the XML object.

Code:
library(XML); library(RJSONIO); library(RCurl)
xmlRequest <- getURL("http://maps.google.com/maps/api/geocode/[COLOR="red"]xml?[/COLOR]sensor=false&address=6000+J+Street,+Sacramento,+CA")
jsonRequest <- getURL("http://maps.google.com/maps/api/geocode/[COLOR="blue"]json?[/COLOR]sensor=false&address=6000+j+Street,+Sacramento,+CA")
x <- fromJSON(jsonRequest)
y <- xmlToList(xmlRequest)
x$results[[1]]$geometry$location[['lat']]  # Namespace to access Latitude from parsed JSON
y$result$geometry$location$lat  # Namespace to access Latitude from parsed XML
We're doing the same thing in both cases, but JSON and XML get parsed differently, for better or worse. The difference is, XML can't store "vectors", per se. JSON has array objects that get converted to vectors. This is why the lat-long from the parsed JSON is a 2-point named vector (and your gGeoCode won't work; you can't access named vectors with the '$' namespace, which is why I used the "[['lat']]")

Notice also I highlighted the URL. The root portion of the geocode API is just "http://maps.google.com/maps/api/geocode/". From there, we have two different 'functions' to call. This is no different than in my API "http://www.bryangoodrich.com/api/" is the root path, but I have a function "get.php" that accepts GET requests. It accepts them with two parameters: format and limit. I defined the PHP to give a default value to limit and require a format specification. I could have simply defined two different functions: json and xml, and require the user to send the parameters to the right function--[noparse]http://www.bryangoodrich.com/api/json?limit=2[/noparse] or [noparse]http://www.bryangoodrich.com/api/xml?[/noparse]. But I didn't. Instead, I used the return format to be specified as a parameter--[noparse]http://www.bryangoodrich.com/api/get.php?format=json&limit=1[/noparse]. Thus, in that construct.geocode.url function, the "return.call" is just saying "which web document will we access? The web document is programmed in some language, probably PHP (it doesn't require a file extension, but I put get.php to be explicit). Since we know it takes GET request, every GET request begins with a question mark to pass parameters. Just look at these Google search examples for "bevmo".

Code:
# The web document is "search" and accepts a request taking in parameters--client, channel, q, ie, and oe, separated by '&'
https://www.google.com/[COLOR="darkorange"]search?[/COLOR]client=ubuntu[COLOR="red"]&[/COLOR]channel=fs[COLOR="red"]&[/COLOR]q=bevmo[COLOR="red"]&[/COLOR]ie=utf-8[COLOR="red"]&[/COLOR]oe=utf-8

# The web document is "maps" and accepts a request taking in parameters--client, channel, q, oe, um, ie, hl, sa, and tab, separated by '&'
http://maps.google.com/[COLOR="darkorange"]maps?[/COLOR]client=ubuntu[COLOR="red"]&[/COLOR]channel=fs[COLOR="red"]&[/COLOR]q=bevmo[COLOR="red"]&[/COLOR]oe=utf-8[COLOR="red"]&[/COLOR]um=1[COLOR="red"]&[/COLOR]ie=UTF-8[COLOR="red"]&[/COLOR]hl=en[COLOR="red"]&[/COLOR]sa=N[COLOR="red"]&[/COLOR]tab=wl
The difference is the Google Geocode API takes a GET request and simply returns an XML or JSON string (depending on whichever web document we send the request to), whereas "search" and "maps" do entirely different things (return an HTML document with our search results or a map display). In either case, we're fundamentally doing the same thing, sending a GET request to a web document on their server specified by the URL we're passing. This is why you can enter in those Geocode API URLs in your web browser and view the XML or JSON in your web browser (see the example link earlier).

There is no more difficulty in using XML or JSON, so it is a matter of taste. I prefer JSON because I think it is a better format by design. I also like the fact we can access the named vector of coordinates once parsed. We just have to make sure we're using the right namespace when accessing the object. Your function just needs to supply the address. Thus, create the function

Code:
requestJSON <- function(address) {
  URL <- paste("http://maps.google.com/maps/api/geocode/[COLOR="sandybrown"]json?[/COLOR]sensor=false&address=", address, sep = "")
  jsonRequest <- getURL(URL)
  json <- fromJSON(jsonRequest)
  return(json)
}
x <- requestJSON("6000+J+Street,+Sacramento,+CA")
x$results[[1]]$geometry$location
#        lat        lng 
#   38.56567 -121.42564
Alternatively, we could do

Code:
requestXML <- function(address) {
  URL <- paste("http://maps.google.com/maps/api/geocode/[COLOR="sandybrown"]xml?[/COLOR]sensor=false&address=", address, sep = "")
  xmlRequest <- getURL(URL)
  xml <- xmlToList(xmlRequest)
  return(xml)
}
x <- requestXML("6000+J+Street,+Sacramento,+CA")
c(lat = as.numeric(x$result$geometry$location$lat), lng = as.numeric(x$result$geometry$location$lng))
#       lat        lng 
#   38.56567 -121.42564
As you can see, requesting an address geocode is not difficult. You should manually be able to do this with a single instance and look at the JSON or XML (use your web browser, it has proper spacing and stuff for 'pretty' viewing). You should be able to parse the XML or JSON to an R list (possibly a table, depending on what the API returns and if it is tabular). Look at that list and understand it (like how I know the namespaces in each case). Looking above, it is easy to parse them, but JSON is parsed much more nicely. Things that are vectors stay vectors and stay the proper data type! (XML returns numeric strings, not numbers). You can certainly make a wrapper like the above to only return a numeric two-point vector like I printed in these two examples, and even parameterize like "xml" or "json" request to alter the URL based on what format you want to return (and parse). That just depends on how much of your work flow you want within a function; keep your functions focused on the task they are designed for. For instance, the wrappers I defined above do what they say, request the data transform format from the Google API and return it parsed into an R list. I can then in a single call store that list and extract the geometry. The real question is, what are you going to do with it? If you're going to expand a table (frame) of already existing data, then you'll want to append a column for each of these (and maybe not return a vector but a point).

Code:
df <- read.csv(url("http://www.nysed.gov/admin/SEDdir.txt"), header = FALSE)   # I'm not bothering with classes right now
names(df) <- c("officer", "school", "beds", "address", "city", "state", "zip", "area", "phone", "record", "grade")
df <- transform(df, location = paste(gsub("[[:space:]]", "+", address), paste(gsub("[[:space:]]", "+", city), state, substr(zip, 1, 5), sep = "+"))
addresses <- df$location
addresses <- as.character(addresses)
latitude <- vector("numeric", length(addresses))
longitude <- vector("numeric", length(addresses))
for (n in seq(addresses)) {
  x <- requestJSON(addresses[n])
  latitude[n]  <- x$results[[1]]$geometry$location[[1]]
  longitude[n] <- x$results[[1]]$geometry$location[[2]]
  Sys.sleep(0.1)  # Suspend requests for a tenth of a second; 200 ms may suffice
}  # end for n
df <- cbind(df, lat = latitude, lng = longitude)
Now what did I say? Don't bother trying this for 2,000 requests! Try it for, say, 20 and see if the first 20 worked. If you try it without the Sys.sleep, you'll notice that half of your requests will come back with a non-OK status due to the rate at which you're accessing the API (it's rate restricted as well as quota restricted, and it'll give the same OVER QUOTA LIMIT in each case). You also want to make sure you're getting out of these functions what you want. Otherwise, it's pointless to do it for a bulk process and get NA's! Make sure it works in a pilot version and then use it for the full thing. Prototype your process to confirm it does what you want, and then put it into production. Get it? Good! Now go back to the first page and see what I mentioned about this process. You'll see a lot of overlap ;)
 

bryangoodrich

Probably A Mammal
#32
Note, I tested the above, swapping df for df[sample(1:nrow(df), 20), ] (a random sample of 20 records). It all worked out fine.

Code:
                                          officer
720                    PRINCIPAL - MR. BRETT KING
3200          PRINCIPAL - MR. CHRISTOPHER WARNOCK
6870                 PRINCIPAL - DR. RYAN PACATTE
6486              PRINCIPAL - MR. STEPHEN DONOHUE
6933 ACTING SUPERINTENDENT - DR. EDWARD J. REILLY
698             PRINCIPAL - MR. DANIEL SHORNSTEIN
943         SUPERINTENDENT - DR. PAUL M. CONNELLY
7034                PRINCIPAL - MR. DAREN CERRONE
4664               DIRECTOR - MR. CHARLES HOUSTON
2808                 PRINCIPAL - MR. BRUCE SEGALL
5984              PRINCIPAL - MS. MELISSA AUSFELD
3127                PRINCIPAL - MS. VALERIE REIDY
1872       PRINCIPAL - MS. ALISON GLICKMAN-ROGERS
4524                  PRINCIPAL - MS. KAREN ZUVIC
6802           PRINCIPAL - MR. DOUGLAS SILVERNELL
64           SUPERINTENDENT - MR. ROBERT K. LIBBY
4333             PRINCIPAL - MR. VINCENT RANDAZZO
5681         ACTING PRINCIPAL - MS. M DIANA JABIS
6351         SUPERINTENDENT - DR. DONALD A. JAMES
2705             DIRECTOR - MS. COURTNEY KNOWLTON
                                   school         beds               address
720  CHANCELLOR LIVINGSTON ELEMENTARY SCH 131801040002            PO BOX 351
3200                  IS 181 PABLO CASALS 321100010181    800 BAYCHESTER AVE
6870   PALMYRA-MACEDON SENIOR HIGH SCHOOL 650901060001          151 HYDE PKY
6486                  WADING RIVER SCHOOL 580601040003 1900 WADNG RVR MNR RD
6933                        TUCKAHOE UFSD 660302030000       65 SIWANOY BLVD
698               TITUSVILLE INTERMEDIATE 131601060013         128 MEADOW LN
943         SPRINGVILLE-GRIFFITH INST CSD 141101060000         307 NEWMAN ST
7034         HAWTHORNE COUNTRY DAY SCHOOL 660802999880       5 BRADHURST AVE
4664          QUEENS CENTERS FOR PROGRESS 342900997801        82-25 164TH ST
2808           MOTHER CABRINI HIGH SCHOOL 310600145240 701 FT WASHINGTON AVE
5984 COBLESKILL-RICHMONDVILLE HIGH SCHOOL 541102060002            PO BOX 269
3127         BRONX HIGH SCHOOL OF SCIENCE 321000011445         75 W 205TH ST
1872    SCHOOL 9M-OCEANSIDE MIDDLE SCHOOL 280211030009         186 ALICE AVE
4524                                PS 86 342800010086    87-41 PARSONS BLVD
6802             QUEENSBURY MIDDLE SCHOOL 630902030003       455 AVIATION RD
64                         COHOES CITY SD  10500010000            7 BEVAN ST
4333  IS 250 THE ROBERT F KENNEDY COMM MS 342500010250        158-40 76TH RD
5681              VIOLA ELEMENTARY SCHOOL 500401060012            557 RT 202
6351                         COMMACK UFSD 580410030000     480 CLAY PITTS RD
2705   EAST HARLEM SCHOOL AT EXODUS HOUSE 310400999536        309 E 103RD ST
               city state       zip area   phone record grade
720       RHINEBECK    NY 125720351  845 8715570      1     1
3200          BRONX    NY 104751702  718 9045600      1     2
6870        PALMYRA    NY 145221297  315 5973420      1     5
6486   WADING RIVER    NY 117922137  631 8218253      1     1
6933    EASTCHESTER    NY     10709  914 3376600      3    NA
698    POUGHKEEPSIE    NY     12603  845 4864470      1     1
943     SPRINGVILLE    NY 141411599  716 5923230      3    NA
7034      HAWTHORNE    NY 105322154  914 5928526      2     7
4664        JAMAICA    NY 114321120  718 3740002      2     7
2808       NEW YORK    NY 100403702  212 9233540      2     5
5984  RICHMONDVILLE    NY     12149  518 2343565      1     5
3127          BRONX    NY     10468  718 8177700      1     5
1872      OCEANSIDE    NY 115722206  516 6788518      1     3
4524        JAMAICA    NY 114323315  718 2916264      1     1
6802     QUEENSBURY    NY 128042914  518 8243610      1     2
64           COHOES    NY 120473299  518 2370100      3    NA
4333       FLUSHING    NY     11366  718 5919000      1     2
5681        SUFFERN    NY 109012999  845 3578315      1     1
6351 EAST NORTHPORT    NY 117313828  631 9122010      3    NA
2705       NEW YORK    NY 100295502  212 8768775      2     2
                                        location      lat       lng
720                PO+BOX+351+RHINEBECK+NY+12572 41.93183 -73.90744
3200           800+BAYCHESTER+AVE+BRONX+NY+10475 40.87559 -73.83379
6870               151+HYDE+PKY+PALMYRA+NY+14522 43.05766 -77.24426
6486 1900+WADNG+RVR+MNR+RD+WADING+RIVER+NY+11792 40.94442 -72.84216
6933        65+SIWANOY+BLVD+EASTCHESTER+NY+10709 40.94208 -73.81340
698          128+MEADOW+LN+POUGHKEEPSIE+NY+12603 41.66832 -73.87058
943           307+NEWMAN+ST+SPRINGVILLE+NY+14141 42.51741 -78.65592
7034          5+BRADHURST+AVE+HAWTHORNE+NY+10532 41.08292 -73.79658
4664             82-25+164TH+ST+JAMAICA+NY+11432 40.71878 -73.80335
2808     701+FT+WASHINGTON+AVE+NEW+YORK+NY+10040 40.85816 -73.93545
5984           PO+BOX+269+RICHMONDVILLE+NY+12149 42.63424 -74.56403
3127                75+W+205TH+ST+BRONX+NY+10468 40.87762 -73.89109
1872            186+ALICE+AVE+OCEANSIDE+NY+11572 40.62402 -73.63037
4524         87-41+PARSONS+BLVD+JAMAICA+NY+11432 40.70921 -73.80242
6802         455+AVIATION+RD+QUEENSBURY+NY+12804 43.32979 -73.68217
64                    7+BEVAN+ST+COHOES+NY+12047 42.77073 -73.71021
4333            158-40+76TH+RD+FLUSHING+NY+11366 40.72419 -73.80937
5681                 557+RT+202+SUFFERN+NY+10901 41.14364 -74.11263
6351   480+CLAY+PITTS+RD+EAST+NORTHPORT+NY+11731 40.86895 -73.29197
2705            309+E+103RD+ST+NEW+YORK+NY+10029 40.78837 -73.94306

or for short (which makes a good primary geography table that can be linked to the attribute table)

Code:
             beds      lat       lng
720  131801040002 41.93183 -73.90744
3200 321100010181 40.87559 -73.83379
6870 650901060001 43.05766 -77.24426
6486 580601040003 40.94442 -72.84216
6933 660302030000 40.94208 -73.81340
698  131601060013 41.66832 -73.87058
943  141101060000 42.51741 -78.65592
7034 660802999880 41.08292 -73.79658
4664 342900997801 40.71878 -73.80335
2808 310600145240 40.85816 -73.93545
5984 541102060002 42.63424 -74.56403
3127 321000011445 40.87762 -73.89109
1872 280211030009 40.62402 -73.63037
4524 342800010086 40.70921 -73.80242
6802 630902030003 43.32979 -73.68217
64    10500010000 42.77073 -73.71021
4333 342500010250 40.72419 -73.80937
5681 500401060012 41.14364 -74.11263
6351 580410030000 40.86895 -73.29197
2705 310400999536 40.78837 -73.94306
 

trinker

ggplot2orBust
#33
Oh my gosh that was extremely helpful Bryan. :) I'm not at bryangoodrich level of understanding but am miles ahead of where I was yesterday and light years ahead of where I was a week ago.

bryangoodrich said:
Now what did I say? Don't bother trying this for 2,000 requests! Try it for, say, 20 and see if the first 20 worked. If you try it without the Sys.sleep, you'll notice that half of your requests will come back with a non-OK status due to the rate at which you're accessing the API (it's rate restricted as well as quota restricted, and it'll give the same OVER QUOTA LIMIT in each case).
I didn't quite understand the access rate problem you and Dason discussed before. I tried this with 10 requests and everything checked out so I sent of 2000. As you said many came back NA. That's why. This is actually my first time using system sleep for something useful. Spot on with he explanation.
 

bryangoodrich

Probably A Mammal
#34
Yeah, ten isn't enough. I only found out this problem when I reviewed my requests (551 of them) and noticed I had some missing at the end. Then I noticed I had a regular sequence of them missing throughout! The goal is to do a sample (my 500 weren't overburdening anyway), but the sample has to be big enough to catch possible problems. Ten would not suffice.
 

trinker

ggplot2orBust
#35
Sorry it's been a couple of days if you're following along. I've been wrapping my head around geocoding and how URLs work. Let me start by saying a big thank you to bryangoodrich. He's provided me with assistance and direction in the geocoding portion. When I first proposed this project I didn't realize how involved this step of the process was going to be for me.

Below I provide you with the script thus far that includes two functions for geocoding. One function to grab from google and another to grab from University of Southern California. The getURL function is capable of returning more than just latitude and longitude but I narrowed the focus of the geocodiong function to represent the scope of or needs.

It will take a few days to grab all the geo codes as we're limited to grabbing 2500 requests per day per website. You'll need to sign up for an api key from the University of Southern California. This is quick and pretty easy.

I assume many/most people won't want to go through the trouble of actually getting the geocodes over several days as I am doing. There for I will provide a link to the finished data set (including geocodes) when i have completed the task and then I think we're ready to gram some demographic data, merge it with the geocodes and do some plotting.

Code:
##########
# STEP 1 #
##########
path <- "http://www.nysed.gov/admin/SEDdir.txt"
cnames <- c("adminstrator", "school", "beds", "address", "city", "state", 
              "zip", "area", "phone", "record", "grade")
classes  <- c(rep("character", 9), rep("factor", 2))

dat <- read.csv(file = url(path), header = FALSE, strip.white = TRUE, sep= ",", 
           na.strings= c(" ", ""),stringsAsFactors = FALSE, col.names = cnames, colClasses = classes)

dat$locations <- as.character(with(dat, paste(address, city, state, zip, sep="+")))
NAer <- function(x) which(is.na(x)) #use to locate missing data
lapply(dat, NAer)   #use to locate missing data
dat <- dat[-283, ]   #remove observation 283
head(dat)
###########################################################################################
#                 STEP 2: CONVERT THE ADDRESS TO LAT AND LONGITUDE.                       #
###########################################################################################
# Many geocoding websites limit you to 2500 requests per day.                             #
# We will utilize two geo coding websites to work through the                             #
# problem faster.  The two sites we will be utilizing:                                    #
#                                                                                         #
# Google API:                                                                             #
# browseURL("http://code.google.com/apis/maps/articles/geocodestrat.html")                #
#                                                                                         #
# University of Southern California's geocoding web services:                             #
# browseURL("https://webgis.usc.edu/Services/Geocode/WebService/GeocoderWebService.aspx") #
# NOTE: This site requires you sign up for an use an api.key that you must provide to     #
# the csus_geocode function.                                                              #
###########################################################################################
# GOOGLE'S API GEOCODING FUNCTION #
###################################
google_geocode <- function(ADDRESS){
    require(XML)
    requestXML <- function(address) {
        URL <- paste("http://maps.google.com/maps/api/geocode/xml?sensor=false&address=", address, sep = "")
        xmlRequest <- getURL(URL)
        xml <- xmlToList(xmlRequest)
        return(xml)
    }

    A2 <- gsub("+", ",+", ADDRESS, fixed = TRUE)
    A3 <- gsub(" ", "+", A2, fixed = TRUE)
    x <- requestXML(A3)
    y <- c(lat = as.numeric(x$result$geometry$location$lat), lng = as.numeric(x$result$geometry$location$lng))
    return(y)
} #end of function
###########################################################################################
# df <- dat[sample(1:nrow(dat), 20), ]  #test it with 20 first
# df <- dat[, ]  #Select specific rows (chunks for geocoding)
addresses <- df$locations
latitude <- vector("numeric", length(addresses))
longitude <- vector("numeric", length(addresses))

for (n in seq(addresses)) {
  x <- google_geocode(addresses[n])
  latitude[n]  <- x[1]
  longitude[n] <- x[2]
  Sys.sleep(0.1)  # Suspend requests for a tenth of a second
}  # end of function
 
df <- cbind(df, lat = latitude, lng = longitude)
###########################################################################################
# UNIVERSITY OF SOUTHERN CALIFORNIA'S GEOCODING FUNCTION #
##########################################################
csus_geocode <- function(ADDRESS, api.key = "PUT YOUR API KEY HERE"){
    require(XML)
    ADDRESS <- unlist(strsplit(ADDRESS, "+", fixed = TRUE))
    
    geocode_url <- function(address, city, state, zip, api.key){ #function to make urls
        address <- gsub(" ", "+", address, fixed = TRUE)
        city <- gsub(" ", "+", city, fixed = TRUE)
        zip <- substring(zip, 1, 5)
        z <- "http://webgis.usc.edu/Services/Geocode/WebService/GeocoderWebServiceHttpNonParsed_V02_96.aspx?"
        x <- paste(z, "streetAddress=", address, "&city=", city, "&state=", state, "&zip=", 
            zip, "&census=false&format=xml&version=2.96&apiKey=", api.key, sep="") 
        return(x)
    } #end of url creation helper function

    requestXML <- function(URL) { #address to coordinates function
        URL <- URL
        xmlRequest <- getURL(URL)
        xmlRequest <- substr(xmlRequest, 4, nchar(xmlRequest)) #removes 4 garbage characters at the begining
        xml <- xmlToList(xmlRequest)
        return(xml)
    }#end of address to coordinates function

    URL <- geocode_url(address = ADDRESS[1], city = ADDRESS[2], state = ADDRESS[3], zip =ADDRESS[4], 
        api.key = api.key)
    x <- requestXML(URL)
    y <- c(lat = as.numeric(x$OutputGeocode$Latitude), lng = as.numeric(x$OutputGeocode$Longitude))
    return(y)
} #end of function
#####################################################################################################
# df <- dat[sample(1:nrow(dat), 20), ]  #test it with 20 first
# df <- dat[, ]  #Select specific rows (chunks for geocoding)
addresses <- df$locations
latitude <- vector("numeric", length(addresses))
longitude <- vector("numeric", length(addresses))

for (n in seq(addresses)) {
  x <- csus_geocode(addresses[n])
  latitude[n]  <- x[1]
  longitude[n] <- x[2]
  Sys.sleep(0.1)  # Suspend requests for a tenth of a second
}  # end of function

df <- cbind(df, lat = latitude, lng = longitude)
#####################################################################################################
#              GEOCODING A WAITING GAME :)                             #
########################################################################
# USE THE ABOVE TO CSUS_GEOCODING AND GOOGLE_GEOCODING FUNCTIONS TO    #
# TAKE CHUNKS OF AROUND 2000 ROWS FOR GEOCODING OVER A SEVERAL DAY     #
# PERIOD. I WOULD SUGGEST SAVING THE CHUNKS TO AN .RDATA FILE TO PIECE #
# BACK TOGATHER WHEN YOU'VE FINISHED GEOCODING.                        #
########################################################################
 

trinker

ggplot2orBust
#36
UPDATE:
4107 out of 7345 (60%) complete with geocoding the data.

A few observations/notes
-It appears that with the University of Southern California it takes 2 credits per geocode. Not sure why.
-I abandoned the JSON approach because there was something in my data set that caused it to say that the script was out of bounds.
-At times google's API will not return lat and long for some addresses where as University of Southern California always seems to
-The Southern California site allows unlimited credits (even per day) once you become a partner (you just have to go on the web sit and add more on)
 

bryangoodrich

Probably A Mammal
#37
What does it mean to "become a partner?" If I can reset my credits, I'll do it! lol Might be something worth my agency investing in ('donate' to USC to get unlimited geocoding services over the web (maybe securely) would be a great help for any company). Last I checked USC doesn't have a JSON format. Check their documentation (I believe they have CSV, XML, KML, and a few others).

EDIT: When you store your geocoded data set, I'd recommend doing it in a compressed format like ".bz". It's easily accessible on Linux and Windows. (I believe WinZip can open it.) For R users, I heard you can make a file connection to it to uncompress and grab across the web in one move. I've mentioned somewhere that I read on SO that ".zip" are actually file structured, so you have to download the file to a temp document, decompress it, read it in, then delete your temp document. Not difficult, but a few steps more involved. Note, bzip gets higher compression rates, I think.
 

trinker

ggplot2orBust
#38
Becoming a partner is free you just have to:

The USC website said:
A partner is a user who utilizes the services on this site by being granted special account privileges. Becoming a parner means that you can use our services as many times as you like, for free. All that is required is:

i) you agree that your account information is complete, correct, and up-to-date,
ii) you agree that you have completely and truthfully filled out the form below,
iii) you agree to provide attribution of our services on your website, products, and/or services that are derived from the use of our services,
iv) you agree that you will not disrupt, interfere with, or otherwise abuse, our site, or any services, system resources, accounts, servers, or networks connected to or accessible through the our site,
v) you agree that you will not abuse your partner status in a menner that results in over-use of our site, or any services, system resources, or servers, and
vi) you agree that your organization's name and/or logo may be listed on our site.
After you fill out the form you will automatically be allocated 1250 transaction credits while our staff reviews your request.

Please note that our system monitors for user accounts that appear to be abusing their partner status. If a situation arises, the account in question will have its partner status automatically revoked while the matter is investigated.
bryangoodrich said:
Last I checked USC doesn't have a JSON format. Check their documentation (I believe they have CSV, XML, KML, and a few others).
Yeah I knew that so I had two different functions. One that used google api with RJSON and one that used USC API with XML. It worked on random 20 and I was sick of blowing through credits to figure out why (that was before I knew I could get more credits but I have to use XLM with USC anyway)
 

bryangoodrich

Probably A Mammal
#39
You could keep your function generic and just adjust it with a parameter for which geocoding source you use (which we could then include Bing, Yahoo, and others). If you wanted to extend the program! Something like

Code:
geocode <- function(address, format = xml, api = usc, ...) 
#  The api key for USC can be where the "..." gets placed, or it can be a designated parametered, but checked when api is Google to be ignored.
{
    if (api = "usc")
        path <- paste("http://webgis.usc.edu/Services/Geocode/WebService/GeocoderWebServiceHttpNonParsed_V02_96.aspx?format", format, sep = "=")
    if (api = "google")  # This should be an if-else block with a final option returning an error about non-matched format options
        if (format = xml) path <- "http://maps.google.com/maps/api/geocode/xml?" else path <- "http://maps.google.com/maps/api/geocode/json?"
        .... do geocoding requests now ...
}
I'll post my script for geocoding my crime data once I create the R scripts to grab their files and organize the data. I should get working on that this weekend at the latest.
 

trinker

ggplot2orBust
#40
I found out once you dip below 2500 credits on USC's API you can request another 2500 if you sign up to be a partner. They means in theory you can have 4999 credits at a time. I decided to run everything again through just CSU so I can do it in 2 passes and rbind/merge two files. I checked the time it takes to run 4500 records with the function above on a windows 7.

~45 minutes.

EDIT:
Finished geocoding the whole data set in two swipes (n = 7345)
It took ~ 30 minutes for the remaining part of the data or a total or 1 hour and 15 minutes for the data set. Not bad for 7345 cases.

Next step(s) merge the files. I'll upload this file in some format tomorrow. I also plan on grabbing some interesting demographic data (probably ELA/math test scores) to augment the data already contained in the data file we have (may or mat not use it).