Import JSON example

bryangoodrich

Probably A Mammal
#1
As some of you may have noticed, I've been playing around with JSON and R lately. I'm also trying to learn jQuery, which is a powerful JavaScript library (with its own syntax) that makes making dynamic and interactive web pages much easier to handle. In addition, it makes handling JSON data transports much easier.

In that respect, I was looking at the examples on the getJSON help page.

You can literally copy-paste the following into a notepad and save it as "whatever.html". View that web document, and it should show you a series of four images drawn from the flicker API.

Code:
<!DOCTYPE html>
<html>
<head>
  <style>img{ height: 100px; float: left; }</style>
  <script src="http://code.jquery.com/jquery-latest.js"></script>
</head>
<body>
  <div id="images">

</div>
<script>
$.getJSON("http://api.flickr.com/services/feeds/photos_public.gne?jsoncallback=?",
  {
    tags: "cat",
    tagmode: "any",
    format: "json"
  },
  function(data) {
    $.each(data.items, function(i,item){
      $("<img/>").attr("src", item.media.m).appendTo("#images");
      if ( i == 3 ) return false;
    });
  });</script>

</body>
</html>
Now, if you want to see the data set the JSON request is working on, you need only look at the GET request of which it consists.

http://api.flickr.com/services/feeds/photos_public.gne?tags=cat&tagmode=any&format=json​

As you can see, all we did was take the 3 API parameters tags, tagmode, and format, as named value equal pairs and append them to the URL with ampersand separators. It returns a big JSON string. Nicely, it comes with tab and newline formatting.

That is the JSON string that is being read by the jQuery and used in constructing the HTML document. Anyone with some HTML experience should recognize what is going on, though a little documentation about the jQuery operations might help. Basically, you're appending the content requested to the named division as image tags based on the URL provided within the JSON elements specified. It uses a "for each" loop, which is similar to a Python loop over a list or an R loop over a vector of values.

Anyway, this is where we come to "how can I get this JSON into R?"

I've been trying to make this happen. With the RCurl library, one can easily read it in directly as above

Code:
getURL("http://api.flickr.com/services/feeds/photos_public.gne?jsoncallback=?tags=cat&tagmode=any&format=json")
You'll see (say, write it to the "clipboard" and paste it into notepad or something) that we get the same result as the link above. Here is a snippet.

Code:
({
		"title": "Uploads from everyone",
		"link": "http://www.flickr.com/photos/",
		"description": "",
		"modified": "2012-01-24T23:20:03Z",
		"generator": "http://www.flickr.com/",
		"items": [
	   {
			"title": "IMG_4045",
			"link": "http://www.flickr.com/photos/27674175@N05/6757372941/",
			"media": {"m":"http://farm8.staticflickr.com/7006/6757372941_ab7c61426e_m.jpg"},
			"date_taken": "2012-01-14T15:25:31-08:00",
			"description": " <p><a href=\"http://www.flickr.com/people/27674175@N05/\">Filipe Vieira Moniz<\/a> posted a photo:<\/p> <p><a href=\"http://www.flickr.com/photos/27674175@N05/6757372941/\" title=\"IMG_4045\"><img src=\"http://farm8.staticflickr.com/7006/6757372941_ab7c61426e_m.jpg\" width=\"160\" height=\"240\" alt=\"IMG_4045\" /><\/a><\/p> ",
			"published": "2012-01-24T23:20:03Z",
			"author": "nobody@flickr.com (Filipe Vieira Moniz)",
			"author_id": "27674175@N05",
			"tags": ""
	   },
	   {
			"title": "_MG_9870",
			"link": "http://www.flickr.com/photos/talesofvesper/6757373121/",
			"media": {"m":"http://farm8.staticflickr.com/7015/6757373121_f9769d6f40_m.jpg"},
			"date_taken": "2012-01-24T01:02:03-08:00",
			"description": " <p><a href=\"http://www.flickr.com/people/talesofvesper/\">Mist and Lilies<\/a> posted a photo:<\/p> <p><a href=\"http://www.flickr.com/photos/talesofvesper/6757373121/\" title=\"_MG_9870\"><img src=\"http://farm8.staticflickr.com/7015/6757373121_f9769d6f40_m.jpg\" width=\"240\" height=\"160\" alt=\"_MG_9870\" /><\/a><\/p> ",
			"published": "2012-01-24T23:20:06Z",
			"author": "nobody@flickr.com (Mist and Lilies)",
			"author_id": "47130918@N05",
			"tags": ""
	   },
	   {
			"title": "IMG_1575",
			"link": "http://www.flickr.com/photos/jaimecuartas/6757373177/",
			"media": {"m":"http://farm8.staticflickr.com/7014/6757373177_33b5e59a21_m.jpg"},
			"date_taken": "2012-01-21T14:41:16-08:00",
			"description": " <p><a href=\"http://www.flickr.com/people/jaimecuartas/\">Jaime Cuartas<\/a> posted a photo:<\/p> <p><a href=\"http://www.flickr.com/photos/jaimecuartas/6757373177/\" title=\"IMG_1575\"><img src=\"http://farm8.staticflickr.com/7014/6757373177_33b5e59a21_m.jpg\" width=\"240\" height=\"160\" alt=\"IMG_1575\" /><\/a><\/p> ",
			"published": "2012-01-24T23:20:07Z",
			"author": "nobody@flickr.com (Jaime Cuartas)",
			"author_id": "58788327@N05",
			"tags": ""
	   },
The problem now is to get this string interpreted. I can't seem to get that to work. I've tried other approaches from the RCurl library. I either get a connection problem or an empty list (if I wrap the return string in textConnection or set the parameter asText to TRUE). My aim here was merely to get it into an R list that I could then drill down and grab the URLs to the images or something. Importing this using fromJSON from the RJSONIO package has failed.

Any ideas? Anyone interested?
 

bryangoodrich

Probably A Mammal
#2
Well ****. I used the fromJSON command (much slower; entirely written in R) from the rjson package, and I got an error about the start and end brackets (parentheses). So, I substring those out. It worked! I got a very convoluted list object. In particular, it is a list of 6 things. The 6th being the items element (see the jQuery use of the object ".item"). The jQuery then does a for each loop on this. It works by defining two things: an object of the item element and an index for it (something R is lacking when doing apply operations!). I assume the convention is to use the singular form of your list. In this case, the list object items you do a jQuery loop on "item".

Now, the items list element is itself a list of 20 items. Each item is a list of 9 items as shown in the link or snippet above. In other words, it returns 20 images each with a 9-part list structure of its content. As the jQuery uses, we want the "m" value stored in the "media" object. We can access it for any element by

Code:
x$items[[2]]$media['m'] # The parsed JSON was stored in 'x' and the 'media' list element is a singular named vector point
Neat? I think so! I went back to the RJSONIO package and it worked on the formatted JSON. There has got to be a way to parse this more appropriately by having it remove the parentheses. This is a default behavior for managing a JSON object in JavaScript--i.e., you do not just send it the string, as using in R here. You wrap it in parentheses and use a JSON parser ('eval' in JS works, but it is dangerous because it can execute any code; the parser does validation that it is nothing but JSON in the string)
 

bryangoodrich

Probably A Mammal
#3
I guess I could add that there are other methods that can be used, I am assuming, with a little bit of cleaning up. For instance, the getForm and other sort of RCurl functions allow you to send a request and give named parameters or a named vector/list of the parameters sent to the web server. For instance, the above sent 3 parameters I mentioned: tags, tagmode, and format. We could do something like

Code:
getForm("http://api.flickr.com/services/feeds/photos_public.gne?jsoncallback=?", tags="cat", tagmode="any", format="json", binary = FALSE)
The return object isn't the same, though, and I haven't corrected my above error to make it work in those cases (it is the same JSON character vector, but with some attributes; fix exactly the same way). Nevertheless, this is a more intuitive and easier to manage approach to forming the request. For instance, you could take user input in R for those parameters, store them in a named vector or list, and then submit that directly in the above. You could just as well put them together and form the getURL string above. For instance, suppose it is in a named vector 'x'. Then something like below should work

Code:
paste(... stuff here..., sep = "=", collapse = "&")
Now append that to the url string and you have the original string I used above in the getURL call.

For those interested, I suggest looking more into POST vs GET methods (or when to use each), and check out the curl library of functions (libcurl). It is C (C - URL, get it?) functionality for operating over the internet and has certain functionalities that RCurl is trying to mimic. I read in my investigations that Python already has more intuitive and easier ways of interfacing with the curl library. It may just be easier, at this point, to use Python for more complex and systematic queries. As I'm also been on a kick regarding Google APIs, I'll also throw out you can use cURL for that, too!
 

bryangoodrich

Probably A Mammal
#4
You thought I was done with my monologue? Ha!

You may be wondering why it is important to know these faculties between R and web technologies. Take for instance the Port of Long Beach's Clean Air Action Plan. I had to access their historical data, which required me to manually "copy-paste" (into R) the presented tables at about 3 months at a time for an average of 4 years for 6 different sites. It took me the better part of 2 days! There was no other accessible way to interface with their database I could see at the time. Now, suppose they offered the ability to query the database like the above API. Then I could have literally requested the entire available data in one large JSON string that could easily be pulled into R. I could not imagine it taking more than 20 minutes at the most.

Another example is when I had to interface with the Bureau of Labor Statistics FTP site. I created an R script to grab their somewhat tabular text files since they were named appropriately by the year. But imagine if they stored all their data in a much easier to interface with database that accepted a web request like above. Then I could submit a query and receive a JSON of the desired data, nearly immediately accessing precisely the information I wanted. Moreover, they wouldn't have to give you the data in some esoteric form that requires you to parse their data based on a "series id" that contains several pieces of information. It could already be parsed into appropriately normalized tables and accessible into XML or JSON transport formats. I'm not assuming this is a feasible thing for them to do, because they would have to restructure their entire website to access the new database, but I would see it working as both better for their management and for extending it to the public or researchers.

So if you ever have the chance to interact with database administrators (DBA's) for some reason, you might want to help them get the sort of perspective I've tried to outline in these additional commentary. It may take a bit more work in designing data interfaces, but it is ultimately a benefit to the data managers and especially useful for the users. In today's Internet driven framework, it is just most effective to take advantage of these technologies. I say contribute!
 

bryangoodrich

Probably A Mammal
#5
In case you didn't notice my excitement in the chatbox, I've added the functionality discussed here to my website to test it out for myself. In the future, I may offer data in that way for anyone that wants to follow my research. It may also just prove to be a useful way to store and manage information for my website. Nevertheless, this comes out really easy to use.

Code:
library(RCurl)
library(RJSONIO)
json <- getForm("http://www.bryangoodrich.com/api/get.php", format = "json")
fromJSON(json)
# $posts
# $posts[[1]]
# $posts[[1]]$post
#           name    awesomeness     profession         status 
#      "Trinker"      "lacking" "velociraptor"        "taken" 
# 
# 
# $posts[[2]]
# $posts[[2]]$post
#               name        awesomeness         profession             status 
#            "Dason" "Does Not Compute"            "Robot"            "D-Bot" 
# 
# 
# $posts[[3]]
# $posts[[3]]$post
#            name     awesomeness      profession          status 
# "bryangoodrich" "mind blowing!"   "Data Master"        "Single"
I also have an XML format

Code:
library(RCurl)
library(XML)
xmltree <- getForm("http://www.bryangoodrich.com/api/get.php", format="xml", limit=2)
xmlTreeParse(xmltree)
# $doc
# $file
# [1] "<buffer>"
# 
# $version
# [1] "1.0"
# 
# $children
# $children$persons
# <persons>
#  <person>
#   <name>Trinker</name>
#   <awesomeness>lacking</awesomeness>
#   <profession>velociraptor</profession>
#   <status>taken</status>
#  </person>
#  <person>
#   <name>Dason</name>
#   <awesomeness>Does Not Compute</awesomeness>
#   <profession>Robot</profession>
#   <status>D-Bot</status>
#  </person>
# </persons>
# 
# 
# attr(,"class")
# [1] "XMLDocumentContent"
# 
# $dtd
# $external
# NULL
# 
# $internal
# NULL
# 
# attr(,"class")
# [1] "DTDList"
# 
# attr(,"class")
# [1] "XMLDocument"         "XMLAbstractDocument" "oldClass"
Later I will add a "format=html" version that will print to the web browser an HTML table of the select query this is doing. The fact is, I can pretty much use these parameters to control whatever behavior I want my script to do. The fact I can easily transport information from a database in this fashion, and it only took me one night and several lines of code to make this accessible, I can't believe data providers wouldn't use this approach more often!

Edit: I also noticed that XML has a useful parameter in the case that we can get raw XML from a URL, as is the case in my script.

Code:
xmlParseTree("http://www.bryangoodrich.com/api/get.php?format=xml&limit=2", isURL = TRUE)
This produces the same results as above, except file is replaced with the URL string instead of "buffer".

Code:
xmlParseTree("http://www.bryangoodrich.com/api/get.php?format=xml&limit=2", isURL = TRUE, useInternalNodes = TRUE)
This returns only the raw XML content. You can use a mix of xmlNodeSet and xmlValue to read in specific information (see their help files).

An easy way to get this tabular would be

Code:
xmlToList(xmltree)
#             person         person            
# name        "Trinker"      "Dason"           
# awesomeness "lacking"      "Does Not Compute"
# profession  "velociraptor" "Robot"           
# status      "taken"        "D-Bot"
 

trinker

ggplot2orBust
#6
Bryan I read through this and followed along with your examples (NOTE: I think xmlParseTree should be xmlTreeParse).

my question is: How can I use this for getting lat and long from googleAPI?
 

bryangoodrich

Probably A Mammal
#7
Since RCurl isn't working on my Linux box right now, I can't reproduce a specific example, but you should first become familiar with the simple example: requesting one address. The fact is, you already know how to do this. You set up your GET request (the appropriate URL). You catch the return JSON or XML string. You parse the return string into an R object. I used a simple example above you can test it out on and replicate very easily. The only difference now is you need to become familiar with what the return object Google Geocoding API will return. I cover that in my other thread and you did it yourself already. I haven't tested this yet, but try something like

Code:
library(RJSONIO)  # To parse JSON strings
library(RCurl)  # To send GET requests, among many other things
json <- getURL("http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=6000+J+Street,+Sacramento,+CA")  # Full GET URL specifying an address
x <- fromJSON(json)  # Converts it to an R list
x$results[[1]]$geometry$location  # If this is correct, it should grab the 2-point named vector of lat and long for the given address.
Using the above procedure, you can gather your list of addresses and do this repeatedly. If you do it from Google, you'll need to tell the system to sleep for a fraction of a second, otherwise Google won't let you do more than ten requests in a row.

EDIT: I tested out the above after looking at the RCurl FAQ (needed to install an additional curl library component, I guess). I got (38.56567, -121.42564) for the lat-long of CSU Sacramento.

The full content of the JSON that Google returns can be viewed simply in your browser (LINK).

The fromJSON parse returns this

Code:
 str(x)
List of 2
 $ results:List of 1
  ..$ :List of 4
  .. ..$ address_components:List of 9
  .. .. ..$ :List of 3
  .. .. .. ..$ long_name : chr "6000"
  .. .. .. ..$ short_name: chr "6000"
  .. .. .. ..$ types     : chr "street_number"
  .. .. ..$ :List of 3
  .. .. .. ..$ long_name : chr "California State University - Sacramento"
  .. .. .. ..$ short_name: chr "California State University - Sacramento"
  .. .. .. ..$ types     : chr "establishment"
  .. .. ..$ :List of 3
  .. .. .. ..$ long_name : chr "J St"
  .. .. .. ..$ short_name: chr "J St"
  .. .. .. ..$ types     : chr "route"
  .. .. ..$ :List of 3
  .. .. .. ..$ long_name : chr "Central Sacramento"
  .. .. .. ..$ short_name: chr "Central Sacramento"
  .. .. .. ..$ types     : chr [1:2] "neighborhood" "political"
  .. .. ..$ :List of 3
  .. .. .. ..$ long_name : chr "Sacramento"
  .. .. .. ..$ short_name: chr "Sacramento"
  .. .. .. ..$ types     : chr [1:2] "locality" "political"
  .. .. ..$ :List of 3
  .. .. .. ..$ long_name : chr "Sacramento"
  .. .. .. ..$ short_name: chr "Sacramento"
  .. .. .. ..$ types     : chr [1:2] "administrative_area_level_2" "political"
  .. .. ..$ :List of 3
  .. .. .. ..$ long_name : chr "California"
  .. .. .. ..$ short_name: chr "CA"
  .. .. .. ..$ types     : chr [1:2] "administrative_area_level_1" "political"
  .. .. ..$ :List of 3
  .. .. .. ..$ long_name : chr "United States"
  .. .. .. ..$ short_name: chr "US"
  .. .. .. ..$ types     : chr [1:2] "country" "political"
  .. .. ..$ :List of 3
  .. .. .. ..$ long_name : chr "95819"
  .. .. .. ..$ short_name: chr "95819"
  .. .. .. ..$ types     : chr "postal_code"
  .. ..$ formatted_address : chr "6000 J St, California State University - Sacramento, Sacramento, CA 95819, USA"
  .. ..$ geometry          :List of 3
  .. .. ..$ location     : Named num [1:2] 38.6 -121.4
  .. .. .. ..- attr(*, "names")= chr [1:2] "lat" "lng"
  .. .. ..$ location_type: chr "ROOFTOP"
  .. .. ..$ viewport     :List of 2
  .. .. .. ..$ northeast: Named num [1:2] 38.6 -121.4
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "lat" "lng"
  .. .. .. ..$ southwest: Named num [1:2] 38.6 -121.4
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "lat" "lng"
  .. ..$ types             : chr "street_address"
 $ status : chr "OK"
It is a list of 2: a results and status. The status is just a character string specifying the status of the request. The results are a list of one unnamed element, which is why I had to do x$results[[1]] to get the full body of content. That content has a lot of things, like the geocoded information. To understand more about this, see the JSON website that explain what a JSON object is. In its most simple form, it is just a name:value pair

{"name":value}

Where here, "value" can be a number of things, like a string

{"name":"OK"}

A number

{"Latitude":-121}

An array

{"Results":[{"geometry":"stuff"},{"address":"stuff"}]}

The above is an array of objects, but it could also be an array of arrays, and we could have multiple objects alongside "Results" (as in the practical example here). Play around with this and the request I provide from my website. You can also specify that it returns XML and parse that all the same. The idea is to understand how R is translating the provided format, whatever format that is, into an R object.
 

trinker

ggplot2orBust
#8
Code:
> json
[1] "{\n   \"results\" : [],\n   \"status\" : \"OVER_QUERY_LIMIT\"\n}\n"
Ughhhh! the wait!!!!!!!!!!!!!!!!!!!!!!
 

bryangoodrich

Probably A Mammal
#9
You can still parse that! The only difference is that empty array "[]" will have stuff in it. Stuff that the documentation specifies will be returned, and you know how to access it in XML. It is not any different from using fromJSON and accessing it as a list in R.
 
#10
Hi all,

I am new to this forum (and new to importing json!); it has helped me several times, but here I can't get around my problem without asking.
My problem is quite simple : I want to build a distance matrix between several US universities, using data from google maps. With the Google distance matrix API, I can get a JSON request of the following form (I am starting with a simple example, i.e. distance from Harvard to Berkeley) : http://maps.googleapis.com/maps/api/distancematrix/json?origins=HARVARD&destinations=Berkeley&mode=driving&language=en-EN&sensor=false

Thanks to your posts, I have managed to extract the information I need, that is the distance. It works with the following code :
Code:
json <- getURL("http://maps.googleapis.com/maps/api/distancematrix/json?origins=HARVARD&destinations=Berkeley&mode=driving&language=en-EN&sensor=false")
x <- fromJSON(json)
dist<-x$rows[[1]]$elements[[1]]$distance$value
Now, my problem is when I want to start building a matrix, let's say a 2x2 matrix for starters. The json request would be the following : http://maps.googleapis.com/maps/api/distancematrix/json?origins=Harvard|Berkeley&destinations=Berkeley|Haravard&mode=driving&language=en-EN&sensor=false"

From there, I don't know how to build a symmetric distance matrix. I guess it would be something of the form :
Code:
dist<-x$rows[[?]]$elements[[?]]$distance$value
But I don't know what to put between the '[[ ]]' to get the 4 distance values.

Any advice would be great ; since I'm quite new to R, I apology if the answer looks obvious!

Thanks,

Fred