+ Reply to Thread
Page 1 of 4 1 2 3 4 LastLast
Results 1 to 15 of 53

Thread: Still trying to learn to scrape

  1. #1
    ggplot2orBust
    Points: 36,885, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,014
    Thanks
    1,419
    Thanked 781 Times in 694 Posts

    Still trying to learn to scrape




    Alright my scraping skills are getting better but they're still babyish. If something's not in a nice table I have a hard time getting the info out of the page. I'm trying to follow Bryan's example here (LINK) but not having much luck.

    here's the web page: http://www.statistics.com/resources/glossary/

    I want to scrape the list of stats terms as a vector. as an end product I'll be able to get rid of the letter subheadings and the back to top lines.

    I don't care som much about the product in this as the process. This is a learning thing. I've alread simple cut and pasted the text in and used readLines to get it into R. But I want to learn this RCurl. Bryan makes me jealous and I know it's a handy skill.

    Anyway the point of impasse is in the RCurl attempt with the getNodeSet. Bryan uses
    Code: 
    "//pre//text()"
    . It doesn't work for me. Worse than that I don't know what it's doing and reading the help file for me is not that helpful. Please help me to understand how to do this so I can do it on my own in future situtations.

    Code: 
    library(RCurl)
    library(XML)
    
    URL <- "http://www.statistics.com/resources/glossary/"
    
    #don't think this will work at all (not a table)
    readHTMLTable(URL, which=1)
    
    #produces something but I don't know what
    readLines(URL)
    
    #the RCurl way 
    doc     = htmlTreeParse(URL, useInternalNodes = TRUE)
    content = getNodeSet(doc, "//pre//text()")
    x = content[10:length(content)]  # first 9 aren't required
    y = x[c(FALSE, TRUE)]  # grab strings
    x = x[c(TRUE, FALSE)]  # grab names
    
    
    doc = htmlTreeParse(URL, useInternalNodes = TRUE)
    content = getNodeSet(doc, "//pre")
    x = xmlToList(content)
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  2. #2
    Probably A Mammal
    Points: 19,517, Level: 88
    Level completed: 34%, Points required for next Level: 333
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,209
    Thanks
    294
    Thanked 498 Times in 453 Posts

    Re: Still trying to learn to scrape

    I'll look at this later, but the "//pre//text()" is xpath.

    http://www.w3schools.com/xpath/default.asp

    The first part "//pre" says "grab all 'pre' nodes in this document." In this case, we're talking about the HTML pre tags that contained the stuff I wanted in that example (look at the source code and see where the pre tags are). Just recognize that HTML is an XML document. It is an XML document with a specific namespace specification that has existed long before XML (a more generic abstraction of HTML to give structure to any type of document, not just web documents). I'm also saying, grab all pre tags, not any that are under some other node (e.g., "//someNode//pre").

    The second part "//text()" executes a function on the node. As the xpath tutorial link will detail somewhere, there are ways to process your document elements. You can say, subset based off of certain properties, look at attributes (e.g., a tag "img src='some image url' height='400' width='400'" has 3 attributes specified on an img tag), or as I do, access the text contained within the tag. Usually you do a tag <someTag> ... blah blah ... </someTag. That stuff between the opening and closing tag is what the xpath text() function will return, and that was all the content we needed to parse in that example.

    So when you say "get me an HTML table" you're basically just looking for table nodes in the XML document and parsing the table rows (tr) and table data (td) tag text for its information. You can do this manually, but it's pretty standard (tabular), so XML comes with a function for it.

    Get it now?

  3. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (06-07-2012)

  4. #3
    Probably A Mammal
    Points: 19,517, Level: 88
    Level completed: 34%, Points required for next Level: 333
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,209
    Thanks
    294
    Thanked 498 Times in 453 Posts

    Re: Still trying to learn to scrape

    What you need to do to figure out how to scrape this data is to look at the data you'll be parsing. View the document. What does the source code look like? Crap. That's what it looks like (line 182 is all one line, and huge!!). So, I'll add some newlines and highlight some important things.

    Code: 
    <h4>0-9</h4>
    <ul class='glossaryList'>
      <li><a href='index.php?page=glossary&term_id=605'>2-Tailed vs. 1-Tailed Tests</a></li>
    </ul>
    <a href='index.php?page=glossary#top' class='backLink'>Back to top</a>
    
    <a name='A'></a><h4>A</h4>
    <ul class='glossaryList'>
      <li><a href='index.php?page=glossary&term_id=209'>A Priori Probability</a></li>
      <li><a href='index.php?page=glossary&term_id=700'>Acceptance Region</a></li>
      <li><a href='index.php?page=glossary&term_id=701'>Acceptance Sampling</a></li>
      <li><a href='index.php?page=glossary&term_id=702'>Acceptance Sampling Plans</a></li>
      ... more of the same ...
      <li><a href='index.php?page=glossary&term_id=713'>Average Group Linkage</a></li>
      <li><a href='index.php?page=glossary&term_id=714'>Average Linkage Clustering</a></li>
    </ul>
    <a href='index.php?page=glossary#top' class='backLink'>Back to top</a>
    
    <a name='B'></a><h4>B</h4>
    <ul class='glossaryList'>
      <li><a href='index.php?page=glossary&term_id=493'>Backward Elimination</a></li>
      ... and so on ...
    These are some of the things that appealed to me as I would try to parse this document. Why are these important? Given what I said above, we have specific nodes (tags) that represent the structure of this list. For instance, the unordered list (ul) node is what contains all the stuff we want. Notice also that all the other stuff like the backLink class links (purple) are not within these ul nodes. Even though these links share that link (href) attribute string page=glossary (gold). These things I've emphasized give is a very good basis to parse this data quite easily. A simple approach may be as simple as "find all anchor link (a) nodes within the glossaryList unordered list nodes and grab their text. That would probably be sufficient. Headers like A, B, C, etc., are header 4 (h4) tags, so if you wanted to access those, you could. Not significant here, but other scenarios could be imagined where the headers would be more important than the content.

  5. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (06-07-2012)

  6. #4
    ggplot2orBust
    Points: 36,885, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,014
    Thanks
    1,419
    Thanked 781 Times in 694 Posts

    Re: Still trying to learn to scrape

    Thanks BG for taking the time to give a thorough explanation. I'll play more tomorrow and let you know how I make out and if I need further direction.

    Again i appreciate the time you give.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  7. #5
    Probably A Mammal
    Points: 19,517, Level: 88
    Level completed: 34%, Points required for next Level: 333
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,209
    Thanks
    294
    Thanked 498 Times in 453 Posts

    Re: Still trying to learn to scrape

    Yeah, this isn't that hard at all. It requires just a few steps

    (1) Get the HTML document ready to be parsed for its XML content.
    (2) Extract the desired nodes and the content we want from them.
    (3) Create our vector from the above information.

    Step (3) required a little bit of work, as you'll recall from the other thread, but it's not that hard once I figured it out (or use Dason's work around). Step (1) is stupid easy, as it's just one command with the right parameter. The real work is in step (2) that requires you understand the XPath required to get exactly what you want. That is the logic I outlined above. I had to review the syntax to get my statement right, but from there and the previous thread example, it wasn't hard at all. Below I will hide the solution I made.

    Spoiler:

  8. #6
    ggplot2orBust
    Points: 36,885, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,014
    Thanks
    1,419
    Thanked 781 Times in 694 Posts

    Re: Still trying to learn to scrape

    I can't mark it solved until I understand it. I haven't looked at the spoiler yet and plan on not until I've really become stuck or I solved it.

    I have class this morning so I'll look later today. Thanks BG.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  9. #7
    ggplot2orBust
    Points: 36,885, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,014
    Thanks
    1,419
    Thanked 781 Times in 694 Posts

    Re: Still trying to learn to scrape

    I cheated. On the upside I needed to cheat and learned by cheating. One question remains: what's the a stand for? Why a. Is this the anchor link mode? If so ca you explain a bit more about this. Will this change from situation to situation?

    The color syntax above was a nice touch for understanding. It seems like the only thinking part of using RCurl is that getNodeSet part and what you supply to it.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  10. #8
    Probably A Mammal
    Points: 19,517, Level: 88
    Level completed: 34%, Points required for next Level: 333
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,209
    Thanks
    294
    Thanked 498 Times in 453 Posts

    Re: Still trying to learn to scrape

    Go back to the HTML

    Code: 
    <h4>0-9</h4>
    <ul class='glossaryList'>
      <li><a href='index.php?page=glossary&term_id=605'>2-Tailed vs. 1-Tailed Tests</a></li>
    </ul>
    <a href='index.php?page=glossary#top' class='backLink'>Back to top</a>
    
    <a name='A'></a><h4>A</h4>
    <ul class='glossaryList'>
      <li><a href='index.php?page=glossary&term_id=209'>A Priori Probability</a></li>
      <li><a href='index.php?page=glossary&term_id=700'>Acceptance Region</a></li>
      <li><a href='index.php?page=glossary&term_id=701'>Acceptance Sampling</a></li>
      <li><a href='index.php?page=glossary&term_id=702'>Acceptance Sampling Plans</a></li>
      ... more of the same ...
      <li><a href='index.php?page=glossary&term_id=713'>Average Group Linkage</a></li>
      <li><a href='index.php?page=glossary&term_id=714'>Average Linkage Clustering</a></li>
    </ul>
    <a href='index.php?page=glossary#top' class='backLink'>Back to top</a>
    
    <a name='B'></a><h4>B</h4>
    <ul class='glossaryList'>
      <li><a href='index.php?page=glossary&term_id=493'>Backward Elimination</a></li>
      ... and so on ...
    Notice the anchor (a) tags. It is their textual content that are the names we want to catalog, correct? So to access them, we're using the XPath statement that drills down the nodes to those anchor tags and grabs their content: //ul[@class='glossaryList']//a//text(). Notice how I only highlighted (purple) those anchor tags (nodes) that fit this definition and their content (orange).

    That translates to "find any unordered list (ul) nodes with attributes class='glossaryList' and subset only those anchor (a) nodes within those glossary list nodes. Finish by returning their textual content (i.e., the text between the anchor tags <a ...> ... some text here ... </a>).

    Every scraping situation is going to depend on the context of that situation. That context will define your method. Here our data was nicely embedded within the HTML and the HTML nodes were nicely attributed in a way we could use a simple XPath statement to grab what we want. In the prior case you link to, everything we wanted was considered plain text within a 'pre' tag (node) and the approach required that we (1) grab that pre node section (the HTML within is considered text since a pre tag is similar to the virtual BB code here we use in a 'noparse'), then (2) parse the text of it. We used the formatted structure of that text to extract the information using grep or something. Different situation requires a different approach. Thus, to successfully scrape, you need to understand how you access the information, how that information is structured, and how you can use that structure to identify the elements of it that you want. That is why I began this thread will focusing on how we could get the information we wanted. I provided a basic logic that should work. Turned out, I was right and that logic was entirely encapsulated in that XPath expression.

    PS: The first HTML coloring I did shows the ul classes (green) that fit the XPath definition. So think of it in those terms. We need to drill down within the nested structure of this HTML (XML) document to grab the text of the nodes that meet our specification. As I alluded to with the prior example, you may not always be able to directly grab that content as some node value. But you should be familiar with HTML and XML document structures, common HTML tags, be able to look at HTML source code to identify the structure (like I did here), and then use that to get "as close to" the data you want as you can. From there it may take additional processing, but the goal is to parse the data any way you can until you get it the way you want. The other problem that arises is access, and that'll probably be something to do with RCurl (e.g., getting to the chatbox through authentication first).

  11. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (06-13-2012)

  12. #9
    Probably A Mammal
    Points: 19,517, Level: 88
    Level completed: 34%, Points required for next Level: 333
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,209
    Thanks
    294
    Thanked 498 Times in 453 Posts

    Re: Still trying to learn to scrape

    For fun, you may try the alternative logic my initial HTML coloring alluded to. I always think first of a brute force method, and I was thinking:

    I want those list elements that are of the glossary, and I only want those anchor tags that link (reference) a page specified by 'page=glossary'. Then I want to extract the xmlValue of those anchors. You can still use XPath to do this, I believe, but it's just another approach. How would you use XPath to at least get close to this solution? Or could you simply grab the correct list elements and then grep the anchors to grab those that meet the condition I specified. Then use grep to grab the contents between the anchor tags (i.e., manually do the "text()" function).

    This just shows, there's more than one way to skin a cat. But it begs the question, wtf are people doing skinning cats?!

    PS: Is it the case that the only anchors within the glossaryList ul nodes are the ones of interest? Then the above logic for specifying "the correct anchor tags" is superfluous, no? These are all things you should be considering when you reason about your approach. It may be the case that within the ul node there were other anchors (e.g., maybe those "back to top" anchors were within the unordered list instead of outside. Not every HTML designer is so good in their coding making it easy for us!). In that case, you would have to do some additional parsing. You should first consider, can XPath help me narrow down that specification? E.g., can it let me search the anchor node attributes in a way that lets me grab those anchors that I want? If not, can I manually find a way to do it myself (e.g., with grep)? That is what I'm getting after. If you're really brave, screw up the html you're provided. Stick those class='backLink' "back to top" anchor tags within the ul nodes. Save it as a text document. Pull that text document into R (instead of using RCurl, you're just parsing a text document in XML). Then attempt this new, more challenging, problem.

  13. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (06-13-2012)

  14. #10
    ggplot2orBust
    Points: 36,885, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,014
    Thanks
    1,419
    Thanked 781 Times in 694 Posts

    Re: Still trying to learn to scrape

    Getting better and closer to understanding this. I have a new problem where I want to scrape what appears (from the html) to be a table. I can pull the emoticons but not the column with the emaning that corresponds to each emoticon. Here's what I have so far:

    Code: 
    library(RCurl)
    library(XML)
    
    URL <- "http://pc.net/emoticons/"  # The scraping target
    doc3   <- htmlTreeParse(URL, useInternalNodes = TRUE)  # Store the HTML document as parsed XML
    nodes <- getNodeSet(doc3, "//td[@class='smiley']//a//text()")  # Extract the path to the node content we want
    x     <- sapply(nodes, xmlValue)  # Convert that XML content into an R character vector
    I'm getting better but this is a new situation.

    Sample of the HTML:

    Code: 
    <table>
    <tr>
    <td class="smiley"><a href="smiley/alien">(.V.)</a></td>
    <td class="def">Alien</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/angel">O:-)</a></td>
    <td class="def">Angel</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/angry">X-(</a></td>
    <td class="def">Angry</td>
    </tr>
    </table>
    <h3>B</h3>
    <table>
    <tr>
    <td class="smiley"><a href="smiley/baby">~:0</a></td>
    <td class="def">Baby</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/big_grin">:-D</a></td>
    <td class="def">Big Grin</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/bird">(*v*)</a></td>
    <td class="def">Bird</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/braces">:-#</a></td>
    <td class="def">Braces</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/broken_heart">&lt;/3</a></td>
    <td class="def">Broken Heart</td>
    </tr>
    </table>
    <h3>C</h3>
    <table>
    <tr>
    <td class="smiley"><a href="smiley/cat">=^.^=</a></td>
    <td class="def">Cat</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/clown">*&lt;:o)</a></td>
    <td class="def">Clown</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/confused">O.o</a></td>
    <td class="def">Confused</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/confused">:-S</a></td>
    <td class="def">Confused</td>
    </tr>
    <tr>
    <td class="smiley"><a href="smiley/cool">B-)</a></td>
    I'm going back through BG's explanations.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  15. #11
    Probably A Mammal
    Points: 19,517, Level: 88
    Level completed: 34%, Points required for next Level: 333
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,209
    Thanks
    294
    Thanked 498 Times in 453 Posts

    Re: Still trying to learn to scrape

    uh, what are you trying to get? It worked exactly as you described "get the textual content of the anchors." It returned

    Code: 
     [1] "(.V.)"   "O:-)"    "X-("     "~:0"     ":-D"     "(*v*)"   ":-#"     "</3"    
     [9] "=^.^="   "*<:o)"   "O.o"     ":-S"     "B-)"     ":_("     ":'("     "QQ"     
    [17] "\\:D/"   "*-*"     ":o3"     "#-o"     ":*)"     "//_^"    ">:)"     "<><"    
    [25] ":-("     ":("      ":-("     "=P"      ":-P"     "8-)"     "$_$"     ":->"    
    [33] "=)"      ":-)"     ":)"      "#"       "<3"      "{}"      ":-|"     "X-p"    
    [41] ":-)*"    ":-*"     "(-}{-)"  "=D"      ")-:"     "(-:"     "<3"      "=/"     
    [49] ":-)(-:"  "@"       "<:3)~"   "~,~"     ":-B"     "^_^"     "<l:0"    ":-/"    
    [57] "=8)"     "@~)~~~~" "=("      ":-("     ":("      ":-7"     ":-@"     "=O"     
    [65] ":-o"     ":-)"     ":)"      ":-Q"     ":>"      ":P"      ":o"      ":-J"    
    [73] ":-&"     "=-O"     ":-\\"    ":-E"     "=D"      ";-)"     ";)"      "|-O"    
    [81] "8-#"
    If you want the name, then what node does it belong to? It belongs to the table data (td) of class "def". That is it. The other one was an example of "get the text on an anchor node inside a table data of class 'smiley.'" Instead, you want "get the text on a table data of class 'def.'"

    Get it? Change your XPath and you're done.

    Spoiler:


    If you want to be creative, find a way to create the data frame of smiley and name in one instance. I'll give it a try after I write this paper, but I would assume there's a creative way to do it instead of extracting the vectors individually and then pasting them together. Maybe not. Will find out!

  16. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (06-16-2012)

  17. #12
    ggplot2orBust
    Points: 36,885, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,014
    Thanks
    1,419
    Thanked 781 Times in 694 Posts

    Re: Still trying to learn to scrape

    Thanks BG! On my own(no spoiler):

    Code: 
    nodes2 <- getNodeSet(doc3, "//td[@class='def']//text()")  # Extract the path to the node content we want
    x2     <- sapply(nodes2, xmlValue)  # Convert that XML content into an R character vector
    The problem was where I was trying to extract the information:
    Code: 
    <td class="smiley"><a href="smiley/angry">X-(</a></td>
    <td class="def">Angry</td>
    I tried to pull it from the red rather than the blue.

    The extra credit professor BG gave is definitely above my skill set but would certainly be of interest.

    EDIT: I looked at you spoiler. Still above me. I always learn by tearing apart what you're doing and trying to apply that to similar and new situations.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  18. #13
    Probably A Mammal
    Points: 19,517, Level: 88
    Level completed: 34%, Points required for next Level: 333
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,209
    Thanks
    294
    Thanked 498 Times in 453 Posts

    Re: Still trying to learn to scrape

    lol I didn't even notice the links had the names in them, but that's because it was unimportant. If the <td class="def">Angry</td> didn't exist, you could extract the "smiley/angry" convert those attribute statements to a vector, split the character string on "/" and keep only the vector of the 2nd part of the split. Make sense?

    Like I said, the document is a tree structure

    Code: 
    // Locate smiley name definition
    html
        --> body
            --> table
                --> tr
                    --> td@class="def"
                        --> text()
    
    // Locate smiley text
    html
        --> body
            --> table
                --> tr
                    --> td@class="smiley"
                        --> a
                            --> text()
    Make sense? There's more crap in the document, but the way the nodes (tags) are nested, this is the "route" or "path" we traverse to get to it. I posted the -full path- to the objects of interest. Using XPath, we can shorten that up. In particular, we're only after those table data (td) nodes of the given classes. Thus, we don't need to find a path across the whole document. We start off at those nodes: //td[@class="def"]. Recognize that if there were another table with a td node of the same class, we'd have been pulling from that other table, also. We may not want that table. Then we might have to do something like //table[1] to grab only the first table (or second, I forget if it indexes by 0. It probably does!).

    The thing to recognize is that XML stores information, but it is information stored in a certain way. That way is a "document" that follows a sort of tree structure. I use the tree representation when we think of "paths" to the nodes of interest. In this respect, the only leaf is the node of interest and every other node is the branch we're traversing to get to the end ('leaf'). Make sense? If not, I can start a google presentation and draw it out for you :P

    EDIT: Actually, looking at the original HTML, there are multiple tables. So the XPath we're using is best because we want all table data (td) nodes of the given class across all tables. If the web designer, that did a good job giving each td type a different class ('def' and 'smiley') wanted to be really rigorous, they would have given each table a class, probably of the type of smiley it is--e.g., class="A", class="B", etc.

  19. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (06-16-2012)

  20. #14
    ggplot2orBust
    Points: 36,885, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,014
    Thanks
    1,419
    Thanked 781 Times in 694 Posts

    Re: Still trying to learn to scrape

    Now I get what you mean by tree structure
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  21. #15
    Probably A Mammal
    Points: 19,517, Level: 88
    Level completed: 34%, Points required for next Level: 333
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,209
    Thanks
    294
    Thanked 498 Times in 453 Posts

    Re: Still trying to learn to scrape


    Good, then notice that the nodes (leaves) we're interested in are spread across many branches--viz., the various tables that contain subsets of the list of smileys. So the XPath we're using to grab all the td nodes of the given class is cutting across ('prune') those multiple branches, giving us access to the content we want. This is clearly a different approach than had we done something like

    Code: 
    //table[1:last()]//td[@class="def"]//text()  # That subset notation is NOT correct!
    The above would say "go through each table from the first to the last and grab their td elements of class "def" and return their textual content." See the difference between that and the pruning we did? They would return the same stuff, but there might be both a difference in performance (if only slight) and in how it is done conceptually. Concepts matter! We don't even need to specify the table. It is superfluous. We might have had to go that route if the web designer did NOT give class names to the td nodes. For instance, suppose they had no class, but we only wanted the second one. Then the above might be what we MUST do, but instead we would have "table[*]//td[2]//text()" instead.

    We might do a pruning across branches if we only want the 2nd td node (each table row expected to have 2 td children). How? Think about R. If we had all the td nodes as a vector, how would we grab each 2nd one? Probably by something like x[c(F, T)], right? Well, XPath lets you do math, so you should be able to do something like "is this position divisible by 2?" Then it should work. Alternatively, and probably more easy, is if the only table row (tr) nodes that exist are smiley records, then we could say "grab all table rows and return the textual content of their 2nd table data node." In XPath what does that mean?

    Spoiler:


    Probably smart if I actually check the above DOES work, because I could be full of **** lol

  22. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (06-17-2012)

+ Reply to Thread
Page 1 of 4 1 2 3 4 LastLast

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats