Scrape text from a website

trinker

ggplot2orBust
#1
I have used R (XLM) to grab tables from websites but they were always in nice neat formats. Let's take this page:

http://stat.ethz.ch/R-manual/R-patched/library/

How could I grab the table here?

I tried;

Code:
library(XML)
URL <- "http://stat.ethz.ch/R-manual/R-patched/library/"
Table <- readHTMLTable(URL,
    colClasses = rep("character", 4),
    which=1
)

Table
 

Dason

Ambassador to the humans
#2
That's not actually a table on the website. What information do you want to grab exactly?

If you want to see the source code:

Code:
URL <- "http://stat.ethz.ch/R-manual/R-patched/library/"
j <- readLines(URL)
 

trinker

ggplot2orBust
#3
Yeah and then do some cleaning. That works. Thanks Dason.

BG may have a fancier way as he's pretty talented in this area.
 

Dason

Ambassador to the humans
#4
Well you never mentioned exactly what you want. It wouldn't be too bad to parse that and grab what you want. And depending on what you want it would probably be pretty simple to just use the XML package to extract the info you want.
 

trinker

ggplot2orBust
#5
I'd like to get this:

Code:
 Name                    Last modified      Size  Description
 MASS/                   22-Apr-2003 09:47    -   
 Matrix/                 21-Apr-2009 12:52    -   
 R.css                   08-Nov-2010 02:47  1.2K  
 base/                   22-Apr-2003 09:47    -   
 boot/                   22-Apr-2003 09:47    -   
 class/                  22-Apr-2003 09:47    -   
 cluster/                22-Apr-2003 09:47    -   
 codetools/              25-Apr-2007 01:07    -   
 compiler/               06-Aug-2011 22:08    -   
 datasets/               12-Oct-2004 10:08    -   
 foreign/                22-Apr-2003 09:47    -   
 grDevices/              12-Oct-2004 10:08    -   
 graphics/               13-Apr-2004 01:07    -   
 grid/                   26-May-2004 18:00    -   
 lattice/                22-Apr-2003 09:47    -   
 methods/                22-Apr-2003 09:47    -   
 mgcv/                   22-Apr-2003 09:47    -   
 nlme/                   22-Apr-2003 09:47    -   
 nnet/                   22-Apr-2003 09:47    -   
 parallel/               31-Mar-2012 03:07    -   
 rcompgen/               25-Apr-2007 01:08    -   
 rpart/                  22-Apr-2003 09:47    -   
 spatial/                22-Apr-2003 09:47    -   
 splines/                22-Apr-2003 09:47    -   
 stats/                  13-Apr-2004 01:08    -   
 stats4/                 13-Apr-2004 01:08    -   
 survival/               04-Nov-2010 03:07    -   
 tcltk/                  22-Apr-2003 09:47    -   
 tools/                  22-Apr-2003 09:47    -   
 utils/                  06-Aug-2011 22:08    -
 

bryangoodrich

Probably A Mammal
#6
Here's something I've worked on to put it into R data types capable of being manipulated. The problem is it's very unstructured and I still can't make complete sense of it, but the data is at least in a list (at this point). There may be a better way to go about this, but the problem is that the actual content isn't enclosed in any tags! So I can't say "get me the node named img" because that doesn't get me the line; it just returns the img tag stuff. I should be able to use that get node approach to access the anchors, though.

Code:
library(RCurl)
library(XML)

url = "http://stat.ethz.ch/R-manual/R-patched/library/"
doc = htmlTreeParse(url, useInternalNodes = TRUE)
content = getNodeSet(doc, "//pre")[[1]]
x = xmlToList(content)

getNodeSet(doc, "//pre//a//text()")  # returns the link entities
getNodeSet(doc, "//pre//text()")  # returns everything parsed into strings
Apparently the xpath function text() will return as strings the content between the anchor tags, but there's more than anticipated. You'll want to skip the first view that are used on the titles.

I now expand that use of text() to the entire pre document. This will get everything, even those unnamed strings to the right of the anchors. Thus, you can manipulate that to get the data

Code:
library(RCurl)
library(XML)

url     = "http://stat.ethz.ch/R-manual/R-patched/library/"
doc     = htmlTreeParse(url, useInternalNodes = TRUE)
content = getNodeSet(doc, "//pre//text()")
content[44:45]

x = content[10:length(content)]  # first 9 aren't required
y = x[c(FALSE, TRUE)]  # grab strings
x = x[c(TRUE, FALSE)]  # grab names
Now you can parse out the name "mgcv/" and the string " 22-Apr-2003 09:47 - " to get the contents. You'll have to figure out how to do this for all the contents as I don't know exactly how it's structured. I'd probably first manipulate it into a data frame with a column of names and a column of the strings. Then parse the strings due to their fixed widths.

Edit: Okay, I've now got vectors of names and strings. The problem is, these aren't in R. They're pointers. If there's a way (which there should be) to make these into R objects (vectors, namely), then we can convert x and y, put them into a data frame, transform the data frame with some helper functions designed to operate on them, and we're done!
 

trinker

ggplot2orBust
#7
Thanks bryan!

Alright now I really have to step back and evaluate every step of what you did and add it to my tool box.

Thanks a bunch :)
 

bryangoodrich

Probably A Mammal
#8
The problem I'm facing with my approach is that the variables x and y always point to external things, so you can't really DO anything to them. I'll have to rethink this approach.
 

bryangoodrich

Probably A Mammal
#9
This would be much easier in Python, since they approach XML in a standard format that DTL apparently failed to do (e.g., you usually use functions like getElementsByTagName(...) or something. Instead, you use getNodeSet and use an XPath statement to get nodes, but now there's no apparent functions to DO something with those nodes that I can see.

Correction! I can use xmlValue on individual nodes. Let me work on this, update in a second.

Code:
library(RCurl)
library(XML)

url     = "http://stat.ethz.ch/R-manual/R-patched/library/"
doc     = htmlTreeParse(url, useInternalNodes = TRUE)
content = getNodeSet(doc, "//pre//text()")
content = sapply(content, xmlValue)  # Return values as R internals
content = content[10:length(content)]  # subset for relevant data
content = data.frame(x = content[c(T, F)], y = content[c(F, T)], stringsAsFactors = FALSE)
produces

Code:
             x                                              y
1  KernSmooth/                    22-Apr-2003 09:47    -   \n
2        MASS/                    22-Apr-2003 09:47    -   \n
3      Matrix/                    21-Apr-2009 12:52    -   \n
4        R.css                    08-Nov-2010 02:47  1.2K  \n
5        base/                    22-Apr-2003 09:47    -   \n
6        boot/                    22-Apr-2003 09:47    -   \n
7       class/                    22-Apr-2003 09:47    -   \n
8     cluster/                    22-Apr-2003 09:47    -   \n
9   codetools/                    25-Apr-2007 01:07    -   \n
10   compiler/                    06-Aug-2011 22:08    -   \n
11   datasets/                    12-Oct-2004 10:08    -   \n
12    foreign/                    22-Apr-2003 09:47    -   \n
13  grDevices/                    12-Oct-2004 10:08    -   \n
14   graphics/                    13-Apr-2004 01:07    -   \n
15       grid/                    26-May-2004 18:00    -   \n
16    lattice/                    22-Apr-2003 09:47    -   \n
17    methods/                    22-Apr-2003 09:47    -   \n
18       mgcv/                    22-Apr-2003 09:47    -   \n
19       nlme/                    22-Apr-2003 09:47    -   \n
20       nnet/                    22-Apr-2003 09:47    -   \n
21   parallel/                    31-Mar-2012 03:07    -   \n
22   rcompgen/                    25-Apr-2007 01:08    -   \n
23      rpart/                    22-Apr-2003 09:47    -   \n
24    spatial/                    22-Apr-2003 09:47    -   \n
25    splines/                    22-Apr-2003 09:47    -   \n
26      stats/                    13-Apr-2004 01:08    -   \n
27     stats4/                    13-Apr-2004 01:08    -   \n
28   survival/                    04-Nov-2010 03:07    -   \n
29      tcltk/                    22-Apr-2003 09:47    -   \n
30      tools/                    22-Apr-2003 09:47    -   \n
31      utils/                    06-Aug-2011 22:08    -   \n
 

bryangoodrich

Probably A Mammal
#11
We can now create functions to parse the internals x and y, respectively. The first is rather easy

Code:
substring(x, 1, nchar(x)-1)
This will remove the leading "/" from those strings. The 'y' will be more difficult as it contains a number of entities we want to extract. This will take a little creativity I'm not going to deal with. It isn't particularly difficult. It's just string manipulation.
 

bryangoodrich

Probably A Mammal
#12
Code:
x <- sapply(x, as, "character")
Does that help?
Actually, it does seem to work. It makes me wonder how exactly it is doing that, though. Direct conversions didn't seem to be happening.

Code:
as.character(content[[44]])
Error in as.vector(x, "character") : 
  cannot coerce type 'externalptr' to vector of type 'character'
Even when calling sapply on "content", I'm pretty sure it's running the XML package version behind the scene.
 

trinker

ggplot2orBust
#13
I didn't know you could do
Code:
x <- sapply(x, as, "character")
. Why does this work? Is as a function? ......a short time later...... Just checked it. Yep it is.
 

Dason

Ambassador to the humans
#14
Yeah I don't know why as.character(x[[1]]) doesn't work but as(x[[1]], "character") does. I just remembered that there are a few edge cases (this apparently being one of them) where using as works and as.*** doesn't.
 

bryangoodrich

Probably A Mammal
#16
Trinker, changing the class won't do anything because the object you're actually dealing with in R is just a pointer, a piece of information that says "that spot in memory is what you want." R needs to basically take the value of that stuff as we are seeing it in the terminal and return it to the R environment. This is why I did sapply with the xmlValue function. It's a function that operates on the C pointers directing to that external object in memory and retrieves its value as an XML object (i.e., as one of those class entities it is). So merely changing the class wouldn't do anything, and you can't change its class (I tried).

My guess is that 'as' is being overloaded and using some other function in the package. Either that, or as is just a really cool low-level function capable of doing something right some how (unlikely).