Okay, I originally did not want to post this on the public forum because the website might not approve, given that you're taxing their server and bandwidth. Nevertheless, I modified this example to keep the load down.
The website http://www.hospitalsworldwide.com provides information about hospitals that they've received information on. It's not necessarily a complete or current account, but it is an account that can prove useful for a data analyst. In my case, I'm interested in their geographic data because I may do some database formulation and basic analysis using this data (health related). To collect that data, I don't need anything beyond the tools available in Linux, except for the cURL utility curl. That's an easy apt-get install curl away, however.
Below are the 3 statements required to get an output file of addresses that could then be uploaded into a GIS program and geocoded. I may see what I can do with the open source Quantum-GIS (QGIS) program or even Python and Google API (they have a geocoding service).
It turns out that California has 551 hospitals listed in their data. To keep this respectful to their servers, I added an extra pipe to the end of line 1: head -n 3. This utility takes the output and limits it to 3 lines (the -n parameter). Thus, only 3 web pages will be requested (line 2) from their server, and only 3 addresses will end up in the final text file.
Explanation
The curl utility is a very powerful command-line tool that uses the C library libcurl. Basically, all that stuff your web browser does, you can do that from the command-line to automate requests of all sorts (ftp, https, and more!). In its most simple form, you send it a url (explicitly I would say curl url = "... url here ..." as is the result of awk).
Thus, I feed it the URL for the listing of California addresses. The web page is basically a bunch of links to the individual hospital pages of interest. All their pages fall under the subdirectories "/listings/###.php" where "###" is some number. Thus, I pipe the HTML content returned from curl into grep. Using grep, I look for that pattern specified above, and only return that content (the -o parameter).
This, if you stopped here, would print to your screen a list of relative paths to 551 URLs. We merely parsed the HTML document returned by curl. To make this usable, we use awk to "append" the full (absolute) path to these urls by prefixing those relative paths with url = "http://www.hospitalsworldwide.com". This is an explicit parameter for curl, which gives rise to line 2.
Before we look at that, however, I should explain what awk is doing. We define a program. It is the stuff between the curly brackets. Since we're doing it at the command-line, the program needs to be enclosed in strings. We use single quotes so there's no conflict with double-quotes in the program. (Note, we escape the quotes inside the print statement so we get literal quotes returned.) The "%s" just indicates in our print string "put a string here." To define what string is going there, we use the "$0". this means "use the entire line." If we had parsed the incoming line by, say, some field delimiter, we could use "$2" to grab the second field or something.
That awk program should be pretty clear now. We now pipe that output to head as explained earlier to limit the output. We only want to start using line 2 with limited content.
Line 2 is short. We use the -K parameter which means "get a list of parameters from a file." Since it requires a file, we could not pipe the first line into this command. Thus, line 1 finishes by redirecting the output (the url = "... absolute URLs ...") into a urls.txt file. This is the input file to curl. If doing the full 551, this takes several minutes to accomplish. It taxes their server. Don't do that!
Line 2 finishes by redirecting the output to a file. It could probably be piped into line 3, but since the full operation could take awhile, I prefer to get it all done at once. The output is literally just a big file with all the HTML content from the 3 (or 551) pages requested by curl.
Since we now have all the data we require, we merely need to find a way to pull it out. This is where grep returns. Sometimes web page content isn't uniform, but since they all use the same JavaScript functions, it will be when the address on each page is hard coded into those functions with a var address = "... address here ... " statement. Thus, line 3 should be obvious now. Since the address is a quoted string, we pipe the results of this grep command into cut, specifying the data as "double-quote-delimited". Thus, we should have two fields "var address = " and the raw address string. Using cut we specify we want the latter (the -f parameter). By redirecting this result into a file, we now have the final product: a text file filled with addresses.
If I use Google API to geocode this, I could probably get away with using curl again, feeding a modified form of this file (using awk) into another curl -K statement. Since the API is nothing but a web request, curl can handle this operation. By doing that, I can easily get the geocoded information I desire. That, however, will have to wait for another day, as I don't know the API, nor what the return content will be.
I could also modify the above commands to collect more information--viz., the last line, since I don't need to bother with downloading the content all over again from line 2. I could modify it to grep out other stuff listed on the web page that could be useful, such as number of beds the hospitals have. That just requires a more sophisticated regular expression or multiple calls, isolating the specific content desired. These could all be paste together or join in some other way. (Yes, those are other Linux utilities I could make use of; see their manual pages for more details).
I hope this proves useful to anyone interested in web scraping (web mining). It also proves to be an example of the power of curl and how awesome Linux is. Without porting these sort of utilities, Windows just simply cannot keep pace!
EDIT: I listed a quick summary of data utilities Linux provides that should come as important, though probably not complete. There are further explanation and examples of awk and cut I've used around TalkStats before, but it belabors the point to hunt them down and add them right now.
The website http://www.hospitalsworldwide.com provides information about hospitals that they've received information on. It's not necessarily a complete or current account, but it is an account that can prove useful for a data analyst. In my case, I'm interested in their geographic data because I may do some database formulation and basic analysis using this data (health related). To collect that data, I don't need anything beyond the tools available in Linux, except for the cURL utility curl. That's an easy apt-get install curl away, however.
Below are the 3 statements required to get an output file of addresses that could then be uploaded into a GIS program and geocoded. I may see what I can do with the open source Quantum-GIS (QGIS) program or even Python and Google API (they have a geocoding service).
Code:
curl www.hospitalsworldwide.com/usa_states/california.php | grep -o /listings/[[:digit:]]*.php | awk '{printf "url = \"www.hospitalsworldwide.com%s\"\n", $0}' | head -n 3 > urls.txt
curl -K urls.txt > html.txt
grep "var address = \".*" html.txt | cut -d"\"" -f2 > addresses.txt
Explanation
The curl utility is a very powerful command-line tool that uses the C library libcurl. Basically, all that stuff your web browser does, you can do that from the command-line to automate requests of all sorts (ftp, https, and more!). In its most simple form, you send it a url (explicitly I would say curl url = "... url here ..." as is the result of awk).
Thus, I feed it the URL for the listing of California addresses. The web page is basically a bunch of links to the individual hospital pages of interest. All their pages fall under the subdirectories "/listings/###.php" where "###" is some number. Thus, I pipe the HTML content returned from curl into grep. Using grep, I look for that pattern specified above, and only return that content (the -o parameter).
This, if you stopped here, would print to your screen a list of relative paths to 551 URLs. We merely parsed the HTML document returned by curl. To make this usable, we use awk to "append" the full (absolute) path to these urls by prefixing those relative paths with url = "http://www.hospitalsworldwide.com". This is an explicit parameter for curl, which gives rise to line 2.
Before we look at that, however, I should explain what awk is doing. We define a program. It is the stuff between the curly brackets. Since we're doing it at the command-line, the program needs to be enclosed in strings. We use single quotes so there's no conflict with double-quotes in the program. (Note, we escape the quotes inside the print statement so we get literal quotes returned.) The "%s" just indicates in our print string "put a string here." To define what string is going there, we use the "$0". this means "use the entire line." If we had parsed the incoming line by, say, some field delimiter, we could use "$2" to grab the second field or something.
That awk program should be pretty clear now. We now pipe that output to head as explained earlier to limit the output. We only want to start using line 2 with limited content.
Line 2 is short. We use the -K parameter which means "get a list of parameters from a file." Since it requires a file, we could not pipe the first line into this command. Thus, line 1 finishes by redirecting the output (the url = "... absolute URLs ...") into a urls.txt file. This is the input file to curl. If doing the full 551, this takes several minutes to accomplish. It taxes their server. Don't do that!
Line 2 finishes by redirecting the output to a file. It could probably be piped into line 3, but since the full operation could take awhile, I prefer to get it all done at once. The output is literally just a big file with all the HTML content from the 3 (or 551) pages requested by curl.
Since we now have all the data we require, we merely need to find a way to pull it out. This is where grep returns. Sometimes web page content isn't uniform, but since they all use the same JavaScript functions, it will be when the address on each page is hard coded into those functions with a var address = "... address here ... " statement. Thus, line 3 should be obvious now. Since the address is a quoted string, we pipe the results of this grep command into cut, specifying the data as "double-quote-delimited". Thus, we should have two fields "var address = " and the raw address string. Using cut we specify we want the latter (the -f parameter). By redirecting this result into a file, we now have the final product: a text file filled with addresses.
If I use Google API to geocode this, I could probably get away with using curl again, feeding a modified form of this file (using awk) into another curl -K statement. Since the API is nothing but a web request, curl can handle this operation. By doing that, I can easily get the geocoded information I desire. That, however, will have to wait for another day, as I don't know the API, nor what the return content will be.
I could also modify the above commands to collect more information--viz., the last line, since I don't need to bother with downloading the content all over again from line 2. I could modify it to grep out other stuff listed on the web page that could be useful, such as number of beds the hospitals have. That just requires a more sophisticated regular expression or multiple calls, isolating the specific content desired. These could all be paste together or join in some other way. (Yes, those are other Linux utilities I could make use of; see their manual pages for more details).
I hope this proves useful to anyone interested in web scraping (web mining). It also proves to be an example of the power of curl and how awesome Linux is. Without porting these sort of utilities, Windows just simply cannot keep pace!
EDIT: I listed a quick summary of data utilities Linux provides that should come as important, though probably not complete. There are further explanation and examples of awk and cut I've used around TalkStats before, but it belabors the point to hunt them down and add them right now.
Last edited: