R and reading data from a xml/html file

#1
Hi I have some html files that I want to read each word into an array they format is

<blockquote>
<A NAME=1.1.2>And I in going, madam, weep o'er my father's death[/COLOR]</A><br>
<A NAME=1.1.3>anew: but I must attend his majesty's command, to</A><br>
<A NAME=1.1.4>whom I am now in ward, evermore in subjection.</A><br>
</blockquote>

basically everything between the <A NAME=#.#.#> and </A>

id like to read into an array (delimiting words with a space) is there a function(s) that will help me doo this? see image attached.
 

bryangoodrich

Probably A Mammal
#3
Don't hold your breath ...

But I will say that if all you want is that text and it is always within that pattern and you don't care what is in between those two tags, then regex is an easy solution. You just need a look back and look forward--i.e, a pattern that is required before or after what is being matched, but is not part of the matched result. You know the look back and forward; it's those two tags. Thus, it is an easy solution. I have code in the TIL thread that demonstrates this sort of regex, I believe. I'm on my phone, so I'm not going to find it. But it would be good practice for you trinker. Do a little Perl regex ;)
 
#5
From the examples in the image given, it's possible that OP may be able to rely solely on an Xpath ("//blockquote/a[@name]") and not need to use regular expressions.
 

trinker

ggplot2orBust
#6
@helicon is correct but that requires figuring out how to do it which they avoided as well. So here's an incorrect way to get the correct answer:

Code:
[COLOR="silver"]dat <- readLines(n=5)
<blockquote>
<A NAME=1.1.2>And I in going, madam, weep o'er my father's death[noparse][/COLOR][/noparse]</A><br>
<A NAME=1.1.3>anew: but I must attend his majesty's command, to</A><br>
<A NAME=1.1.4>whom I am now in ward, evermore in subjection.</A><br>
</blockquote>[/COLOR]

library(qdap)
dat2 <- dat[grepl("<A NAME", dat)]
bracketX(dat2, "angle")
Yielding:

Code:
[COLOR="silver"]> bracketX(dat2, "angle")[/COLOR]
[1] "And I in going, madam, weep o'er my father's death [noparse][/COLOR][/noparse]"
[2] "anew: but I must attend his majesty's command, to"          
[3] "whom I am now in ward, evermore in subjection."
EDIT: If you want a vector of words or a list of word vectors then use:

Code:
bag_o_words(bracketX(dat2, "angle"))
word_split(bracketX(dat2, "angle"))

[COLOR="gray"]## > bag_o_words(bracketX(dat2, "angle"))
##  [1] "and"        "i"          "in"         "going"      "madam"     
##  [6] "weep"       "o'er"       "my"         "father's"   "death"     
## [11] "color"      "anew"       "but"        "i"          "must"      
## [16] "attend"     "his"        "majesty's"  "command"    "to"        
## [21] "whom"       "i"          "am"         "now"        "in"        
## [26] "ward"       "evermore"   "in"         "subjection"


## > word_split(bracketX(dat2, "angle"))
## $`And I in going, madam, weep o'er my father's death [noparse][/COLOR][/noparse]`
##  [1] "And"      "I"        "in"       "going,"   "madam,"   "weep"    
##  [7] "o'er"     "my"       "father's" "death"    "[noparse][/COLOR][/noparse]"
##
## $`anew: but I must attend his majesty's command, to`
## [1] "anew:"     "but"       "I"         "must"      "attend"    "his"      
## [7] "majesty's" "command,"  "to"       
## 
## $`whom I am now in ward, evermore in subjection.`
##  [1] "whom"       "I"          "am"         "now"        "in"        
##  [6] "ward,"      "evermore"   "in"         "subjection" "."[/COLOR]
Note this uses the dev version of these functions. Either download that HERE or use bag.o.words and word.split (periods instead of underscores).
 
#7
I didn't have time to sit down and work out the full code earlier, sorry, but now I do here is the approach I was getting at. Will need some more fine tuning depending on OPs needs.

Code:
library(XML)
myfile = ("/path/to/file")
myhtmldoc = htmlTreeParse(myfile, useInternal = T)
myarray = as.array(strsplit(unlist(xpathApply(myhtmldoc, "//blockquote/a[@name]", xmlValue)), " "))
Which gives:

Code:
[[1]]
 [1] "And"           "I"             "in"            "going,"        "madam,"        "weep"          "o'er"         
 [8] "my"            "father's"      "death[/COLOR]"

[[2]]
[1] "anew:"     "but"       "I"         "must"      "attend"    "his"       "majesty's" "command,"  "to"       

[[3]]
[1] "whom"        "I"           "am"          "now"         "in"          "ward,"       "evermore"    "in"          "subjection."
 

bryangoodrich

Probably A Mammal
#8
Since nobody wanted to do the regex option ...

Code:
# The data, each to its own line in this case.
x <- c("<blockquote>", "<A NAME=1.1.2>And I in going, madam, weep o'er my father's death[/COLOR]</A><br>", 
"<A NAME=1.1.3>anew: but I must attend his majesty's command, to</A><br>", 
"<A NAME=1.1.4>whom I am now in ward, evermore in subjection.</A><br>", 
"</blockquote>")

regex <- "(?<=<A NAME=[0-9]\\.[0-9]\\.[0-9]>).*(?=</A>)"  # No, not the best, but it works
regmatch <- gregexpr(regex, x, perl = TRUE)
regmatches(x, regmatch)  # Returns list with character(0) for unmatched lines

# Better results
unlist(regmatches(x, regmatch))
# [1] "And I in going, madam, weep o'er my father's death[/COLOR]"
# [2] "anew: but I must attend his majesty's command, to"         
# [3] "whom I am now in ward, evermore in subjection."
Honestly, I would probably create wrappers or something to make this easier. These are all low level functions, but obviously you want to go from something like a data structure of sentences to a similar data structure of matches. Each of these steps will help you get there, and each can be used to do different things along the way (like validation). But if you know that what you want fits a specific pattern of "this is behind the sentence" and "this is in front of the sentence," then this approach can easily be used to construct what you want, as long as you're willing to learn a little regex.

Code:
match_between <- function(x, lookback, lookahead, pattern = ".*") {
    regex <- "(?<=__LOOKBACK__)__PATTERN__(?=__LOOKAHEAD__)"
    regex <- gsub("__LOOKBACK__", lookback, regex)
    regex <- gsub("__LOOKAHEAD__", lookahead, regex)
    regex <- gsub("__PATTERN__", pattern, regex)
    regmatch <- gregexpr(regex, x, perl = TRUE)
    regmatches(x, regmatch)
}

unlist(match_between(x, "<A NAME=[0-9]\\.[0-9]\\.[0-9]>", "</A>"))  # Same results
But when it comes to parsing HTML or XML, you're generally better off using the properties of HTML or XML (xpath) to get what you want, which does allow you to do more than the general mode I've outlined above. The point is whether or not the HTML or XML you are parsing has the data in a format that you want. Even if you used standard parsing techniques, you might be simply pulling out information contained therein, but it still will need cleaning up. At that point, you're probably going to still need to do some text manipulation, and some Perl based regex will always be of value!

Learn it. Love it.

PS: I think I'll add something like this to my bmisc package lol