Still trying to learn to scrape

trinker

ggplot2orBust
#1
Alright my scraping skills are getting better but they're still babyish. If something's not in a nice table I have a hard time getting the info out of the page. I'm trying to follow Bryan's example here (LINK) but not having much luck.

here's the web page: http://www.statistics.com/resources/glossary/

I want to scrape the list of stats terms as a vector. as an end product I'll be able to get rid of the letter subheadings and the back to top lines.

I don't care som much about the product in this as the process. This is a learning thing. I've alread simple cut and pasted the text in and used readLines to get it into R. But I want to learn this RCurl. Bryan makes me jealous and I know it's a handy skill.

Anyway the point of impasse is in the RCurl attempt with the getNodeSet. Bryan uses
Code:
"//pre//text()"
. It doesn't work for me. Worse than that I don't know what it's doing and reading the help file for me is not that helpful. Please help me to understand how to do this so I can do it on my own in future situtations.

Code:
library(RCurl)
library(XML)

URL <- "http://www.statistics.com/resources/glossary/"

#don't think this will work at all (not a table)
readHTMLTable(URL, which=1)

#produces something but I don't know what
readLines(URL)

#the RCurl way 
doc     = htmlTreeParse(URL, useInternalNodes = TRUE)
content = getNodeSet(doc, "//pre//text()")
x = content[10:length(content)]  # first 9 aren't required
y = x[c(FALSE, TRUE)]  # grab strings
x = x[c(TRUE, FALSE)]  # grab names


doc = htmlTreeParse(URL, useInternalNodes = TRUE)
content = getNodeSet(doc, "//pre")
x = xmlToList(content)
 

bryangoodrich

Probably A Mammal
#2
I'll look at this later, but the "//pre//text()" is xpath.

http://www.w3schools.com/xpath/default.asp

The first part "//pre" says "grab all 'pre' nodes in this document." In this case, we're talking about the HTML pre tags that contained the stuff I wanted in that example (look at the source code and see where the pre tags are). Just recognize that HTML is an XML document. It is an XML document with a specific namespace specification that has existed long before XML (a more generic abstraction of HTML to give structure to any type of document, not just web documents). I'm also saying, grab all pre tags, not any that are under some other node (e.g., "//someNode//pre").

The second part "//text()" executes a function on the node. As the xpath tutorial link will detail somewhere, there are ways to process your document elements. You can say, subset based off of certain properties, look at attributes (e.g., a tag "img src='some image url' height='400' width='400'" has 3 attributes specified on an img tag), or as I do, access the text contained within the tag. Usually you do a tag [noparse]<someTag> ... blah blah ... </someTag[/noparse]. That stuff between the opening and closing tag is what the xpath text() function will return, and that was all the content we needed to parse in that example.

So when you say "get me an HTML table" you're basically just looking for table nodes in the XML document and parsing the table rows (tr) and table data (td) tag text for its information. You can do this manually, but it's pretty standard (tabular), so XML comes with a function for it.

Get it now?
 

bryangoodrich

Probably A Mammal
#3
What you need to do to figure out how to scrape this data is to look at the data you'll be parsing. View the document. What does the source code look like? Crap. That's what it looks like (line 182 is all one line, and huge!!). So, I'll add some newlines and highlight some important things.

Code:
[COLOR="red"]<h4>0-9</h4>[/COLOR]
<ul [COLOR="green"]class='glossaryList'[/COLOR]>
  <li><a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]&term_id=605'>2-Tailed vs. 1-Tailed Tests</a></li>
</ul>
<a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]#top' [COLOR="purple"]class='backLink'[/COLOR]>Back to top</a>

<a name='A'></a>[COLOR="red"]<h4>A</h4>[/COLOR]
<ul [COLOR="green"]class='glossaryList'[/COLOR]>
  <li><a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]&term_id=209'>A Priori Probability</a></li>
  <li><a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]&term_id=700'>Acceptance Region</a></li>
  <li><a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]&term_id=701'>Acceptance Sampling</a></li>
  <li><a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]&term_id=702'>Acceptance Sampling Plans</a></li>
  ... more of the same ...
  <li><a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]&term_id=713'>Average Group Linkage</a></li>
  <li><a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]&term_id=714'>Average Linkage Clustering</a></li>
</ul>
<a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]#top' [COLOR="purple"]class='backLink'[/COLOR]>Back to top</a>

<a name='B'></a>[COLOR="red"]<h4>B</h4>[/COLOR]
<ul [COLOR="green"]class='glossaryList'[/COLOR]>
  <li><a href='index.php?[COLOR="#a0522d"]page=glossary[/COLOR]&term_id=493'>Backward Elimination</a></li>
  ... and so on ...
These are some of the things that appealed to me as I would try to parse this document. Why are these important? Given what I said above, we have specific nodes (tags) that represent the structure of this list. For instance, the unordered list (ul) node is what contains all the stuff we want. Notice also that all the other stuff like the backLink class links (purple) are not within these ul nodes. Even though these links share that link (href) attribute string page=glossary (gold). These things I've emphasized give is a very good basis to parse this data quite easily. A simple approach may be as simple as "find all anchor link (a) nodes within the glossaryList unordered list nodes and grab their text. That would probably be sufficient. Headers like A, B, C, etc., are header 4 (h4) tags, so if you wanted to access those, you could. Not significant here, but other scenarios could be imagined where the headers would be more important than the content.
 

trinker

ggplot2orBust
#4
Thanks BG for taking the time to give a thorough explanation. I'll play more tomorrow and let you know how I make out and if I need further direction.

Again i appreciate the time you give.
 

bryangoodrich

Probably A Mammal
#5
Yeah, this isn't that hard at all. It requires just a few steps

(1) Get the HTML document ready to be parsed for its XML content.
(2) Extract the desired nodes and the content we want from them.
(3) Create our vector from the above information.

Step (3) required a little bit of work, as you'll recall from the other thread, but it's not that hard once I figured it out (or use Dason's work around). Step (1) is stupid easy, as it's just one command with the right parameter. The real work is in step (2) that requires you understand the XPath required to get exactly what you want. That is the logic I outlined above. I had to review the syntax to get my statement right, but from there and the previous thread example, it wasn't hard at all. Below I will hide the solution I made.

Code:
[COLOR="green"]# The XPath string is highlighted in RED for your convenience. Translated to English, it's saying nothing more than
# Grab the 'ul' nodes in the document that are of class 'glossaryList', and within those grab the 
# 'a' nodes. From within those, return a pointer to their node text content. 
# From there, one needs only to convert that content (pointed to in memory) into an R usable data type. [/COLOR]
library(RCurl)
library(XML)

url   <- "http://www.statistics.com/resources/glossary/"  [COLOR="#708090"]# The scraping target[/COLOR]
doc   <- htmlTreeParse(url, useInternalNodes = TRUE)  [COLOR="#708090"]# Store the HTML document as parsed XML[/COLOR]
nodes <- getNodeSet(doc, "[COLOR="red"]//ul[@class='glossaryList']//a//text()[/COLOR]")  [COLOR="#708090"]# Extract the path to the node content we want[/COLOR]
x     <- sapply(nodes, xmlValue)  [COLOR="#708090"]# Convert that XML content into an R character vector[/COLOR]
 

trinker

ggplot2orBust
#6
I can't mark it solved until I understand it. I haven't looked at the spoiler yet and plan on not until I've really become stuck or I solved it.

I have class this morning so I'll look later today. Thanks BG.
 

trinker

ggplot2orBust
#7
I cheated. :( On the upside I needed to cheat and learned by cheating. One question remains: what's the a stand for? Why a. Is this the anchor link mode? If so ca you explain a bit more about this. Will this change from situation to situation?

The color syntax above was a nice touch for understanding. It seems like the only thinking part of using RCurl is that getNodeSet part and what you supply to it.
 

bryangoodrich

Probably A Mammal
#8
Go back to the HTML

Code:
<h4>0-9</h4>
<ul class='glossaryList'>
  <li><[COLOR="purple"]a[/COLOR] href='index.php?page=glossary&term_id=605'>[COLOR="darkorange"]2-Tailed vs. 1-Tailed Tests[/COLOR]</[COLOR="purple"]a[/COLOR]></li>
</ul>
<a href='index.php?page=glossary#top' class='backLink'>Back to top</a>

<a name='A'></a><h4>A</h4>
<ul class='glossaryList'>
  <li><[COLOR="purple"]a[/COLOR] href='index.php?page=glossary&term_id=209'>[COLOR="darkorange"]A Priori Probability[/COLOR]</[COLOR="purple"]a[/COLOR]></li>
  <li><[COLOR="purple"]a[/COLOR] href='index.php?page=glossary&term_id=700'>[COLOR="darkorange"]Acceptance Region[/COLOR]</[COLOR="purple"]a[/COLOR]></li>
  <li><[COLOR="purple"]a[/COLOR] href='index.php?page=glossary&term_id=701'>[COLOR="darkorange"]Acceptance Sampling[/COLOR]</[COLOR="purple"]a[/COLOR]></li>
  <li><[COLOR="purple"]a[/COLOR] href='index.php?page=glossary&term_id=702'>[COLOR="darkorange"]Acceptance Sampling Plans[/COLOR]</[COLOR="purple"]a[/COLOR]></li>
  ... more of the same ...
  <li><[COLOR="purple"]a[/COLOR] href='index.php?page=glossary&term_id=713'>[COLOR="darkorange"]Average Group Linkage[/COLOR]</[COLOR="purple"]a[/COLOR]></li>
  <li><[COLOR="purple"]a[/COLOR] href='index.php?page=glossary&term_id=714'>[COLOR="darkorange"]Average Linkage Clustering[/COLOR]</[COLOR="purple"]a[/COLOR]></li>
</ul>
<a href='index.php?page=glossary#top' class='backLink'>Back to top</a>

<a name='B'></a><h4>B</h4>
<ul class='glossaryList'>
  <li><[COLOR="purple"]a[/COLOR] href='index.php?page=glossary&term_id=493'>[COLOR="darkorange"]Backward Elimination[/COLOR]</[COLOR="purple"]a[/COLOR]></li>
  ... and so on ...
Notice the anchor (a) tags. It is their textual content that are the names we want to catalog, correct? So to access them, we're using the XPath statement that drills down the nodes to those anchor tags and grabs their content: //ul[@class='glossaryList']//a//text(). Notice how I only highlighted (purple) those anchor tags (nodes) that fit this definition and their content (orange).

That translates to "find any unordered list (ul) nodes with attributes class='glossaryList' and subset only those anchor (a) nodes within those glossary list nodes. Finish by returning their textual content (i.e., the text between the anchor tags [noparse]<a ...> ... some text here ... </a>[/noparse]).

Every scraping situation is going to depend on the context of that situation. That context will define your method. Here our data was nicely embedded within the HTML and the HTML nodes were nicely attributed in a way we could use a simple XPath statement to grab what we want. In the prior case you link to, everything we wanted was considered plain text within a 'pre' tag (node) and the approach required that we (1) grab that pre node section (the HTML within is considered text since a pre tag is similar to the virtual BB code here we use in a 'noparse'), then (2) parse the text of it. We used the formatted structure of that text to extract the information using grep or something. Different situation requires a different approach. Thus, to successfully scrape, you need to understand how you access the information, how that information is structured, and how you can use that structure to identify the elements of it that you want. That is why I began this thread will focusing on how we could get the information we wanted. I provided a basic logic that should work. Turned out, I was right and that logic was entirely encapsulated in that XPath expression.

PS: The first HTML coloring I did shows the ul classes (green) that fit the XPath definition. So think of it in those terms. We need to drill down within the nested structure of this HTML (XML) document to grab the text of the nodes that meet our specification. As I alluded to with the prior example, you may not always be able to directly grab that content as some node value. But you should be familiar with HTML and XML document structures, common HTML tags, be able to look at HTML source code to identify the structure (like I did here), and then use that to get "as close to" the data you want as you can. From there it may take additional processing, but the goal is to parse the data any way you can until you get it the way you want. The other problem that arises is access, and that'll probably be something to do with RCurl (e.g., getting to the chatbox through authentication first).
 

bryangoodrich

Probably A Mammal
#9
For fun, you may try the alternative logic my initial HTML coloring alluded to. I always think first of a brute force method, and I was thinking:

I want those list elements that are of the glossary, and I only want those anchor tags that link (reference) a page specified by 'page=glossary'. Then I want to extract the xmlValue of those anchors. You can still use XPath to do this, I believe, but it's just another approach. How would you use XPath to at least get close to this solution? Or could you simply grab the correct list elements and then grep the anchors to grab those that meet the condition I specified. Then use grep to grab the contents between the anchor tags (i.e., manually do the "text()" function).

This just shows, there's more than one way to skin a cat. But it begs the question, wtf are people doing skinning cats?!

PS: Is it the case that the only anchors within the glossaryList ul nodes are the ones of interest? Then the above logic for specifying "the correct anchor tags" is superfluous, no? These are all things you should be considering when you reason about your approach. It may be the case that within the ul node there were other anchors (e.g., maybe those "back to top" anchors were within the unordered list instead of outside. Not every HTML designer is so good in their coding making it easy for us!). In that case, you would have to do some additional parsing. You should first consider, can XPath help me narrow down that specification? E.g., can it let me search the anchor node attributes in a way that lets me grab those anchors that I want? If not, can I manually find a way to do it myself (e.g., with grep)? That is what I'm getting after. If you're really brave, screw up the html you're provided. Stick those class='backLink' "back to top" anchor tags within the ul nodes. Save it as a text document. Pull that text document into R (instead of using RCurl, you're just parsing a text document in XML). Then attempt this new, more challenging, problem.
 

trinker

ggplot2orBust
#10
Getting better and closer to understanding this. I have a new problem where I want to scrape what appears (from the html) to be a table. I can pull the emoticons but not the column with the emaning that corresponds to each emoticon. Here's what I have so far:

Code:
library(RCurl)
library(XML)

URL <- "http://pc.net/emoticons/"  # The scraping target
doc3   <- htmlTreeParse(URL, useInternalNodes = TRUE)  # Store the HTML document as parsed XML
nodes <- getNodeSet(doc3, "//td[@class='smiley']//a//text()")  # Extract the path to the node content we want
x     <- sapply(nodes, xmlValue)  # Convert that XML content into an R character vector
I'm getting better but this is a new situation.

Sample of the HTML:

Code:
<table>
<tr>
<td class="smiley"><a href="smiley/alien">(.V.)</a></td>
<td class="def">Alien</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/angel">O:-)</a></td>
<td class="def">Angel</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/angry">X-(</a></td>
<td class="def">Angry</td>
</tr>
</table>
<h3>B</h3>
<table>
<tr>
<td class="smiley"><a href="smiley/baby">~:0</a></td>
<td class="def">Baby</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/big_grin">:-D</a></td>
<td class="def">Big Grin</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/bird">(*v*)</a></td>
<td class="def">Bird</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/braces">:-#</a></td>
<td class="def">Braces</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/broken_heart">&lt;/3</a></td>
<td class="def">Broken Heart</td>
</tr>
</table>
<h3>C</h3>
<table>
<tr>
<td class="smiley"><a href="smiley/cat">=^.^=</a></td>
<td class="def">Cat</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/clown">*&lt;:o)</a></td>
<td class="def">Clown</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/confused">O.o</a></td>
<td class="def">Confused</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/confused">:-S</a></td>
<td class="def">Confused</td>
</tr>
<tr>
<td class="smiley"><a href="smiley/cool">B-)</a></td>
I'm going back through BG's explanations.
 

bryangoodrich

Probably A Mammal
#11
uh, what are you trying to get? It worked exactly as you described "get the textual content of the anchors." It returned

Code:
 [1] "(.V.)"   "O:-)"    "X-("     "~:0"     ":-D"     "(*v*)"   ":-#"     "</3"    
 [9] "=^.^="   "*<:o)"   "O.o"     ":-S"     "B-)"     ":_("     ":'("     "QQ"     
[17] "\\:D/"   "*-*"     ":o3"     "#-o"     ":*)"     "//_^"    ">:)"     "<><"    
[25] ":-("     ":("      ":-("     "=P"      ":-P"     "8-)"     "$_$"     ":->"    
[33] "=)"      ":-)"     ":)"      "#"       "<3"      "{}"      ":-|"     "X-p"    
[41] ":-)*"    ":-*"     "(-}{-)"  "=D"      ")-:"     "(-:"     "<3"      "=/"     
[49] ":-)(-:"  "@"       "<:3)~"   "~,~"     ":-B"     "^_^"     "<l:0"    ":-/"    
[57] "=8)"     "@~)~~~~" "=("      ":-("     ":("      ":-7"     ":-@"     "=O"     
[65] ":-o"     ":-)"     ":)"      ":-Q"     ":>"      ":P"      ":o"      ":-J"    
[73] ":-&"     "=-O"     ":-\\"    ":-E"     "=D"      ";-)"     ";)"      "|-O"    
[81] "8-#"
If you want the name, then what node does it belong to? It belongs to the table data (td) of class "def". That is it. The other one was an example of "get the text on an anchor node inside a table data of class 'smiley.'" Instead, you want "get the text on a table data of class 'def.'"

Get it? Change your XPath and you're done.

In other words, drop out the "//a" as the names aren't between links (anchor tags/nodes) and change the td class to 'def' and you're done. Remember, your goal is to find a path through the XML document to the data you want. The data you want is either going to be an attribute of a node or the content within the node or something within the content within the node. The latter will require secondary processing, but the last 2 would be accessed the same: you grab the text of the content in the nodes you want. There's an XPath function to grab attributes you specify, also (which may or may not require secondary processing; I've never done it, yet). Therefore, when you look at the table HTML, you need to think "what node contains what I want?" Then make your XPath expression get to it. You need to view the document like a network in some respects (or tree). One branch was table :: tr :: td@smiley :: a. Another was table :: tr :: td@def. In either case, you only wanted the text as-is within those nodes, so the XPath is quite obvious, no?

If you want to be creative, find a way to create the data frame of smiley and name in one instance. I'll give it a try after I write this paper, but I would assume there's a creative way to do it instead of extracting the vectors individually and then pasting them together. Maybe not. Will find out!
 

trinker

ggplot2orBust
#12
Thanks BG! On my own(no spoiler):

Code:
nodes2 <- getNodeSet(doc3, "//td[@class='def']//text()")  # Extract the path to the node content we want
x2     <- sapply(nodes2, xmlValue)  # Convert that XML content into an R character vector
The problem was where I was trying to extract the information:
Code:
<td class="smiley"><a href="smiley/[COLOR="red"]angry[/COLOR]">X-(</a></td>
<td class="def">[COLOR="blue"]Angry[/COLOR]</td>
I tried to pull it from the red rather than the blue.

The extra credit professor BG gave is definitely above my skill set but would certainly be of interest.

EDIT: I looked at you spoiler. Still above me. I always learn by tearing apart what you're doing and trying to apply that to similar and new situations.
 

bryangoodrich

Probably A Mammal
#13
lol I didn't even notice the links had the names in them, but that's because it was unimportant. If the [noparse]<td class="def">Angry</td>[/noparse] didn't exist, you could extract the "smiley/angry" convert those attribute statements to a vector, split the character string on "/" and keep only the vector of the 2nd part of the split. Make sense?

Like I said, the document is a tree structure

Code:
// Locate smiley name definition
html
    --> body
        --> table
            --> tr
                --> td@class="def"
                    --> text()

// Locate smiley text
html
    --> body
        --> table
            --> tr
                --> td@class="smiley"
                    --> a
                        --> text()
Make sense? There's more crap in the document, but the way the nodes (tags) are nested, this is the "route" or "path" we traverse to get to it. I posted the -full path- to the objects of interest. Using XPath, we can shorten that up. In particular, we're only after those table data (td) nodes of the given classes. Thus, we don't need to find a path across the whole document. We start off at those nodes: //td[@class="def"]. Recognize that if there were another table with a td node of the same class, we'd have been pulling from that other table, also. We may not want that table. Then we might have to do something like //table[1] to grab only the first table (or second, I forget if it indexes by 0. It probably does!).

The thing to recognize is that XML stores information, but it is information stored in a certain way. That way is a "document" that follows a sort of tree structure. I use the tree representation when we think of "paths" to the nodes of interest. In this respect, the only leaf is the node of interest and every other node is the branch we're traversing to get to the end ('leaf'). Make sense? If not, I can start a google presentation and draw it out for you :p

EDIT: Actually, looking at the original HTML, there are multiple tables. So the XPath we're using is best because we want all table data (td) nodes of the given class across all tables. If the web designer, that did a good job giving each td type a different class ('def' and 'smiley') wanted to be really rigorous, they would have given each table a class, probably of the type of smiley it is--e.g., class="A", class="B", etc.
 

bryangoodrich

Probably A Mammal
#15
Good, then notice that the nodes (leaves) we're interested in are spread across many branches--viz., the various tables that contain subsets of the list of smileys. So the XPath we're using to grab all the td nodes of the given class is cutting across ('prune') those multiple branches, giving us access to the content we want. This is clearly a different approach than had we done something like

Code:
//table[1:last()]//td[@class="def"]//text()  # That subset notation is NOT correct!
The above would say "go through each table from the first to the last and grab their td elements of class "def" and return their textual content." See the difference between that and the pruning we did? They would return the same stuff, but there might be both a difference in performance (if only slight) and in how it is done conceptually. Concepts matter! We don't even need to specify the table. It is superfluous. We might have had to go that route if the web designer did NOT give class names to the td nodes. For instance, suppose they had no class, but we only wanted the second one. Then the above might be what we MUST do, but instead we would have "table[*]//td[2]//text()" instead.

We might do a pruning across branches if we only want the 2nd td node (each table row expected to have 2 td children). How? Think about R. If we had all the td nodes as a vector, how would we grab each 2nd one? Probably by something like x[c(F, T)], right? Well, XPath lets you do math, so you should be able to do something like "is this position divisible by 2?" Then it should work. Alternatively, and probably more easy, is if the only table row (tr) nodes that exist are smiley records, then we could say "grab all table rows and return the textual content of their 2nd table data node." In XPath what does that mean?

//tr//td[2]//text() -- Easy, right?

Probably smart if I actually check the above DOES work, because I could be full of **** lol
 

trinker

ggplot2orBust
#16
While this is a new problem I like the idea of keeping all my scraping questions in one location. I decided that to get a transcript of dialogue for my qdap package I'll use a play (can't really use real classroom dialogue because it's not going to get IRB aproval). I decided why not use Shakespeare's Romeo and Juliet (found here LINK). I found it freely available on a website. Now for the challenge using web scraping.

I want to wind up with 5 data frames (one for each act)

Each data frame will have 2 columns:
  1. person
  2. dialogue

Now my attempt to begin the process. I understood things when there was a class. There is no class anymore so the getNodeSet part gets me. Help please.

Code:
library(RCurl)
library(XML)

URL <- "http://shakespeare.mit.edu/romeo_juliet/full.html"

doc3   <- htmlTreeParse(URL, useInternalNodes = TRUE)  # Store the HTML document as parsed XML
nodes <- getNodeSet(doc3, "//a[@name='speech']//b//text()")  # Extract the path to the node content we want
x     <- sapply(nodes, xmlValue)  # Convert that XML content into an R character vector
 

trinker

ggplot2orBust
#17
I know that the problem is that I need to include the numbers but don't know how because they're all different. In other words:

Code:
nodes <- getNodeSet(doc3, "//a[@name='speech64']//b//text()")
extracts some of the names. How do I do this since ever speech number is different?
 

trinker

ggplot2orBust
#18
Code:
nodes <- lapply(1:65, function(i){
        getNodeSet(doc3, paste0("//a[@name='speech", i, "']//b//text()")) 
    }
)
This is closer but still a ways off in that each speech number is used at different times. I'm not sure how to separate it. I don't think it's broken by act in the HTML code.
 

bryangoodrich

Probably A Mammal
#20
Did you go back and read that SHORT xpath document?!

http://www.w3schools.com/xpath/xpath_syntax.asp

What's toward the bottom there? That's right, wildcards!

With that said, I can't seem to apply it effectively myself, yet. I'm just trying to make sure the structure of this document is being parsed correctly

Code:
getNodeSet(doc, "//a[@name"])  [COLOR="green"]# grab all anchor (a) nodes with name attributes (includes both 'speech' and 'x.x.x' named nodes)[/COLOR]
getNodeSet(doc, "//a")  [COLOR="green"]# grab all anchor (a) nodes, regardless of attributes. There are 3935 of of these, 2 more than the above.[/COLOR]
getNodeSet(doc, "//a/@name")  [COLOR="green"]# returns a node set of the attribute values themselves. Can run logical tests against these.[/COLOR]
What I understand here is that you use wildcards to basically say "give me everything." There's no xpath text processing or something like SQL "LIKE" operator to partially match. Instead, you may need to run grep on a node set you get above, like the last one I list. You can do the text processing yourself, get the positions, and then pull those nodes out manually or grab the associated texts.

I used grepl to create a vector of logical matches for "speech" (good enough). I then tried these

Code:
getNodeSet(doc, "//a[@name]/text()")  [COLOR="green"]# Only grabs nodes that have direct text, not further nodes--no "speech" nodes grabbed[/COLOR]
getNodeSet(doc, "//a[@name]/b/text()")  [COLOR="green"]# Returns the other 840 nodes not captured above[/COLOR]
getNodeSet(doc, "//a[@name]/text() | //a[@name]/b/text()")  [COLOR="green"]# Grab them both at the same time, but are they in order?[/COLOR]
As the xpath tutorial explains, you can use that "or" operator to grab two node sets at the same time, basically. Since I also check the length of each of these node sets and they do appear disjoint (3093 and 840, respectively), I'm basically combing them to get the full node set I checked earlier (3933 nodes). So combining the grepl check on the "//a/@name" node set for "speech", and keeping that logical vector, I can index it on that last (combined) node set above. What do I find? It's also of length 840. The real question is, does it accurately extract only the speech names that I wanted? It appears so.

Code:
library(RCurl)
library(XML)

url   <- "http://shakespeare.mit.edu/romeo_juliet/full.html"
doc   <- htmlTreeParse(url, useInternalNodes=TRUE)

x     <- getNodeSet(doc, "//a/@name")  [COLOR="green"]# length(x) = 3933[/COLOR]
x     <- grepl("speech", x)  [COLOR="green"]# length(which(x)) = 840[/COLOR]
nodes <- getNodeSet(doc, "//a[@name]/text() | //a[@name]/b/text()")  [COLOR="green"]# length(nodes) = 3933[/COLOR]
speakers <- nodes[x]  [COLOR="green"]# length(speakers) = 840[/COLOR]
speeches <- nodes[!x]  [COLOR="green"]# length(speeches) = 3093[/COLOR]