Web Scraping in Linux

bryangoodrich

Probably A Mammal
#1
Okay, I originally did not want to post this on the public forum because the website might not approve, given that you're taxing their server and bandwidth. Nevertheless, I modified this example to keep the load down.

The website http://www.hospitalsworldwide.com provides information about hospitals that they've received information on. It's not necessarily a complete or current account, but it is an account that can prove useful for a data analyst. In my case, I'm interested in their geographic data because I may do some database formulation and basic analysis using this data (health related). To collect that data, I don't need anything beyond the tools available in Linux, except for the cURL utility curl. That's an easy apt-get install curl away, however.

Below are the 3 statements required to get an output file of addresses that could then be uploaded into a GIS program and geocoded. I may see what I can do with the open source Quantum-GIS (QGIS) program or even Python and Google API (they have a geocoding service).

Code:
curl www.hospitalsworldwide.com/usa_states/california.php | grep -o /listings/[[:digit:]]*.php | awk '{printf "url = \"www.hospitalsworldwide.com%s\"\n", $0}' | head -n 3 > urls.txt
curl -K urls.txt > html.txt
grep "var address = \".*" html.txt | cut -d"\"" -f2 > addresses.txt
It turns out that California has 551 hospitals listed in their data. To keep this respectful to their servers, I added an extra pipe to the end of line 1: head -n 3. This utility takes the output and limits it to 3 lines (the -n parameter). Thus, only 3 web pages will be requested (line 2) from their server, and only 3 addresses will end up in the final text file.

Explanation

The curl utility is a very powerful command-line tool that uses the C library libcurl. Basically, all that stuff your web browser does, you can do that from the command-line to automate requests of all sorts (ftp, https, and more!). In its most simple form, you send it a url (explicitly I would say curl url = "... url here ..." as is the result of awk).

Thus, I feed it the URL for the listing of California addresses. The web page is basically a bunch of links to the individual hospital pages of interest. All their pages fall under the subdirectories "/listings/###.php" where "###" is some number. Thus, I pipe the HTML content returned from curl into grep. Using grep, I look for that pattern specified above, and only return that content (the -o parameter).

This, if you stopped here, would print to your screen a list of relative paths to 551 URLs. We merely parsed the HTML document returned by curl. To make this usable, we use awk to "append" the full (absolute) path to these urls by prefixing those relative paths with url = "http://www.hospitalsworldwide.com". This is an explicit parameter for curl, which gives rise to line 2.

Before we look at that, however, I should explain what awk is doing. We define a program. It is the stuff between the curly brackets. Since we're doing it at the command-line, the program needs to be enclosed in strings. We use single quotes so there's no conflict with double-quotes in the program. (Note, we escape the quotes inside the print statement so we get literal quotes returned.) The "%s" just indicates in our print string "put a string here." To define what string is going there, we use the "$0". this means "use the entire line." If we had parsed the incoming line by, say, some field delimiter, we could use "$2" to grab the second field or something.

That awk program should be pretty clear now. We now pipe that output to head as explained earlier to limit the output. We only want to start using line 2 with limited content.

Line 2 is short. We use the -K parameter which means "get a list of parameters from a file." Since it requires a file, we could not pipe the first line into this command. Thus, line 1 finishes by redirecting the output (the url = "... absolute URLs ...") into a urls.txt file. This is the input file to curl. If doing the full 551, this takes several minutes to accomplish. It taxes their server. Don't do that!

Line 2 finishes by redirecting the output to a file. It could probably be piped into line 3, but since the full operation could take awhile, I prefer to get it all done at once. The output is literally just a big file with all the HTML content from the 3 (or 551) pages requested by curl.

Since we now have all the data we require, we merely need to find a way to pull it out. This is where grep returns. Sometimes web page content isn't uniform, but since they all use the same JavaScript functions, it will be when the address on each page is hard coded into those functions with a var address = "... address here ... " statement. Thus, line 3 should be obvious now. Since the address is a quoted string, we pipe the results of this grep command into cut, specifying the data as "double-quote-delimited". Thus, we should have two fields "var address = " and the raw address string. Using cut we specify we want the latter (the -f parameter). By redirecting this result into a file, we now have the final product: a text file filled with addresses.

If I use Google API to geocode this, I could probably get away with using curl again, feeding a modified form of this file (using awk) into another curl -K statement. Since the API is nothing but a web request, curl can handle this operation. By doing that, I can easily get the geocoded information I desire. That, however, will have to wait for another day, as I don't know the API, nor what the return content will be.

I could also modify the above commands to collect more information--viz., the last line, since I don't need to bother with downloading the content all over again from line 2. I could modify it to grep out other stuff listed on the web page that could be useful, such as number of beds the hospitals have. That just requires a more sophisticated regular expression or multiple calls, isolating the specific content desired. These could all be paste together or join in some other way. (Yes, those are other Linux utilities I could make use of; see their manual pages for more details).

I hope this proves useful to anyone interested in web scraping (web mining). It also proves to be an example of the power of curl and how awesome Linux is. Without porting these sort of utilities, Windows just simply cannot keep pace!

EDIT: I listed a quick summary of data utilities Linux provides that should come as important, though probably not complete. There are further explanation and examples of awk and cut I've used around TalkStats before, but it belabors the point to hunt them down and add them right now.
 
Last edited:

bryangoodrich

Probably A Mammal
#2
Looking at the Google Geocoding API just now, this could seriously be easier than I thought! As the website explains, you do something like

Code:
http://maps.googleapis.com/maps/api/geocode/json?[COLOR="sandybrown"]address="address goes here"[/COLOR]
This would return JSON strings containing all the information I requested. For details on processing JSON using R, check out my other thread on that.

If you understand what I did above in using curl, you should see that I could use awk and the output from line 3 of my Linux commands to produce a file of curl statements. They would basically append all the above content before the addresses returned from its current rendition. In fact, if I adjust line 3 to not do cut at all, and merely use cut or even awk to ignore everything before the quotes, I could return the quoted addresses as is. This has the benefit that they are ready to be appended to the curl statement. Moreover, I could construct the cURL request differently--e.g., using -d parameter that passes named parameters like addresses="... address here ...".

I would then have a file containing GET requests I'd send to the Google Geocoding API. Their limit is well beyond the 551 I require. I could then do like I did with line 2.

Code:
curl -K apirequests.txt > geocode.json
This should put all the JSON into one file. This file can be parsed using JavaScript very easily or, as I demonstrated, with R for further processing.
 

bryangoodrich

Probably A Mammal
#3
Awesome. Simply awesome!

So here is how I replaced line 3

Code:
grep "var address =" html.txt | grep -o "\".*\"" | awk '{gsub(" ", "+"); len = length($0); printf "url = \"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=%s\"\n", substr($0, 2, len-2);}'  > apis.txt
curl -K apis.txt > geocode.json
I make two calls to grep. The first finds the address lines, and the second filters out for only the quoted portion of the string. I then use awk as the power horse to create my curl parameters. It is a multi-step program put on one line, so let me break it down.

Code:
awk '{
  gsub(" ", "+");
  len = length($0);
  printf "url = \"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=%s\"\n", substr($0, 2, len-2);
}'
The first line uses the awk internal function gsub. It returns a value of how many replacements occur, but behind the scene it does the replacement. Since I didn't specify its third parameter (what to substitute), it assumes it is "$0", which if you recall is the entire record (line). Since at that moment, the white spaces in the quoted address have been converted to "+" marks (required when sending URLs), I want to know the length of this record; they vary line-by-line. I store that in a variable 'len'. Then I print a formatted string as done earlier. In this case, I'm putting together the full API URL with the GET request "json?parameters". The parameters are "sensor" which is required. I set it to false. The second is the address. This is the string we're dealing with. Thus, I mark its inclusion with "%s". We're not just sending it the entire line, though. We're sending it a substring of the line: from beyond the first quote to just before the last quote. I tested it and using "1" didn't work, which is why I used substr($0, 2, len-2).

I redirect these parameters to a file apis.txt. I then use that, as before, with curl and redirect the return JSON strings to the geocode.json file. It took maybe a little over a minute to do my 551 requests. I now have all the data in a structured format. Two important pieces remain: to check which, if any, have a status not equal to OK, and then to extract all the latitude and longitude values stored under geometry. These are what I will use when mapping in a GIS. I can easily use R or Python to handle this file and do both of these tasks. That, however, will have to wait until tomorrow!

Note, I'm not sure but I may have been able to simplify this by using different curl parameters, such as using the '-d' flag. It takes named variables that specify the parameter. I may have been able to skip the awk portion and use the full quoted string (to be converted by curl itself. Then I would have done something like

Code:
grep "var address =" html.txt | grep -o "\".*\"" | awk '{printf "url=\"http://maps.googleapis.com/maps/api/geocode/json?\" -d sensor=\"false\" -d address=%s -G", $0}' > apis.txt
The above isn't correct, but I'm just speculating. The "-G" flag is so curl uses a GET request, which we know it is because it would show up in a fully formed URL (as I've been constructing).

EDIT: I couldn't go away not having it right. The "-d" is a short flag for "--data" to which there are multiple ways to submit it. For one, there is "--data-urlencode" which is the process that would happen when you enter a raw string into a form on a website that you submit: it converts the string like I did above with awk. Thus, the correct way is

Code:
grep "var address =" html.txt | grep -o "\".*\"" | awk '{printf "-G --data-encode sensor=\"false\" --data-encode address=%s url=\"http://maps.googleapis.com/maps/api/geocode/json?\", $0}' > apis.txt
Doing it this way, I'm specifying the url to the API and passing the parameters as parameters to curl. It will be curl that will effectively put together the URL string I did before. The difference here is that the API may not use GET. Instead, it might use POST which isn't just a URL. In that case, you need to pass the parameters like this. Honestly, it is questionable which approach is "better". The nice thing about GET methods is the simplicity of constructing a URL, but as we can see, it is complicated either approach I take. In any case, it's done! But these are things to keep in mind on our next challenge.
 

bryangoodrich

Probably A Mammal
#4
Hit a snag here, in case anyone is interested in solving it.

When I output the JSON to a file for the multiple addresses, I get a formatted unquoted string (i.e., it has whitespaces and newlines) of output like

Code:
{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "1411",
               "short_name" : "1411",
               "types" : [ "street_number" ]
            },
            {
               "long_name" : "E 31st St",
               "short_name" : "E 31st St",
               "types" : [ "route" ]
            },
            {
               "long_name" : "Lynn",
               "short_name" : "Lynn",
               "types" : [ "neighborhood", "political" ]
            },
            {
               "long_name" : "Oakland",
               "short_name" : "Oakland",
               "types" : [ "locality", "political" ]
            },
            {
               "long_name" : "Alameda",
               "short_name" : "Alameda",
               "types" : [ "administrative_area_level_2", "political" ]
            },
            {
               "long_name" : "California",
               "short_name" : "CA",
               "types" : [ "administrative_area_level_1", "political" ]
            },
            {
               "long_name" : "United States",
               "short_name" : "US",
               "types" : [ "country", "political" ]
            },
            {
               "long_name" : "94602",
               "short_name" : "94602",
               "types" : [ "postal_code" ]
            }
         ],
         "formatted_address" : "1411 E 31st St, Oakland, CA 94602, USA",
         "geometry" : {
            "bounds" : {
               "northeast" : {
                  "lat" : 37.79957890,
                  "lng" : -122.23294220
               },
               "southwest" : {
                  "lat" : 37.79956450,
                  "lng" : -122.23294640
               }
            },
            "location" : {
               "lat" : 37.79956450,
               "lng" : -122.23294640
            },
            "location_type" : "RANGE_INTERPOLATED",
            "viewport" : {
               "northeast" : {
                  "lat" : 37.80092068029150,
                  "lng" : -122.2315953197085
               },
               "southwest" : {
                  "lat" : 37.79822271970850,
                  "lng" : -122.2342932802915
               }
            }
         },
         "types" : [ "street_address" ]
      }
   ],
   "status" : "OK"
}
{ ... begin next address ... }
What I want to do is have a comma after that last closing bracket. I can then put a open and closing bracket around the entire file content with a key name. This effectively makes it a huge JSON object {"geocoded":{address},{address},{address}} as opposed to it currently being {address}{address}{address}. My first thought is to incorporate it into the geocoding process by doing the curl statement to the Google Geocode API and pipe the results to awk. However, awk operates at a line-by-line basis. I also have all the content in a file right now, so there's no need to run through the API all over again (granted, it's only a minute and Google allows a ton of requests anyway).

Any ideas? While writing this, it occurred to me I could try using grep. If the closing brackets of the parent object is always just one on its own line, I could do something with awk replacing the line with "}," if grep returns a replacement value for "^}" (i.e., a closing bracket at the start of the line). It's a bit more complicated of an awk program, but it should get the job done (i.e., if (grep("^}", $0) > 0) sub("^}", "},"); print $0).

EDIT: On second thought, this should actually be easier (and awk doesn't have a grep function, it has a sub and gsub function that return the number of replacements that occurred). I simply do the gsub command and print the line. Either it is replaced or it isn't based on the pattern matching. It looks like it worked:

Code:
cat geocode.json | awk '{gsub("^}", "},"); print $0}' > somefilename.json
And the final output I'll create will instead be of the form {geocoded:[{address},{address},...]}. This can easily be parsed and just requires putting a little something on the first and last line of the file. I might just make use of paste using the '-s' flag (serial--i.e., combine them end-to-end instead of side-by-side).
 

Dason

Ambassador to the humans
#5
The thing you're describing at the end can be done a lot easier with sed and is almost a perfect example of when to use sed. It's pretty much the only reason I ever use sed but oh well.

sed 's/INPUTPATTERN/OUTPUTPATTERN/g' inputfile.txt

That does a search (s) for the input pattern, replaces it with the output pattern and you're telling it to do it globally (g) over the entire file. You can reduce memory usage by telling it to edit inplace (-i) but I would only do that if you know what you're doing.
Code:
[07:08:36][dasonk@Snedecor:~](307): cat test.txt 
#/Stuff blah
#}
Ok
{
	{
	hey
	no
	}
}
{
	{
	blah
	si
}

[07:08:48][dasonk@Snedecor:~](308): sed 's/^}/},/g' test.txt > out.txt
[07:08:53][dasonk@Snedecor:~](309): cat out.txt 
#/Stuff blah
#}
Ok
{
	{
	hey
	no
	}
},
{
	{
	blah
	si
},
 

bryangoodrich

Probably A Mammal
#6
Turns out I could make use of awk BEGIN and END blocks. Found out, however, that randomly in the middle of my JSON content are a bunch of "over query limit" responses. Maybe it's an issue with number of requests within a time-frame. I do believe curl has time-options to wait in between requests. I'll have to redo it and use that, because I know Google allows far more than 551 requests! I'll just have it spread out to an appropriate time frame. Redoing these steps are nearly instantaneous.

Code:
cat geocode.json | awk 'BEGIN {print "{\"geocoded\":["} {gsub("^}", "},"); print $0} END {print "]}"}' > somefilename.json
 

bryangoodrich

Probably A Mammal
#7
Turns out the reason I get "over quota limit" responses from the API is because the service is intended for website use only, not user. This means they expect requests to come from a website that then puts the return object on a map or something. Thus, I can receive about 12 requests before it hangs for another 12. The only way I can think of to avoid this is to either find another geocoding service like USC geocoding or ESRI--the latter, I believe, requires my software license or something--or I could try and find a way to pause my transfer. I heard waiting 200ms in between requests should work. The problem is, when doing it from a file for batch processing, I can only set an option, as far as I can see, to slow down the transfer for a given request, not in between request. I may need to use something like R or Python to use cURL and use their pausing capabilities. This would step away from my goal of only using basic Linux tools, however. Choices ...

EDIT: On this topic, I recommend the interested reader to look at Google's Gecoding Strategies. While I could get away with this, and it was a good exercise in learning how to process stuff through various web resources, I may turn to the USC link above or to ESRI as I do have access to their software, just not from home (it's not usable on Linux). Since I'm trying to remain open source, I may go the university route.

EDIT: 30 January 2012

I mentioned ESRI. Here is the link to their North American Locator.
 

Dason

Ambassador to the humans
#8
Why not just do it in a bash script that sleeps between calls?

Code:
#!/bin/bash
for i in $( yoururls ); do
   #your curl command on i
   sleep .02
done
I don't really use curl ever but does the --retry option help out?
 

bryangoodrich

Probably A Mammal
#9
Yeah, I was thinking I could do a bash script, which I could use the practice. It would give me more control over what I'm doing. As for the retry parameter, that is a request when the callback is some sort of error code (e.g., a '404'). I'm not getting error codes. I get a JSON with a status I don't like. To be legit, and to learn more, I think I'll just use the university geocoding service. I could technically just pull the file of addresses into ArcGIS and geocode it from within (or through Python, I'm sure). I'd rather just use the service through some API, though. I want my project to focus on web based data connectivity.