Accessing a structure field (XML package)

Question

I get this structure using the HTMLTreeParser, and I need to have the text contained in the page

doc <- htmlTreeParse(url, useInternalNodes = FALSE)
doc
$file
[1] "http://www.google.com/trends/fetchComponent?q=asdf,qwerty&cid=TIMESERIES_GRAPH_0&export=3"

$version
[1] ""

$children
$children$html
<html>
<body>
<p>// Data table response google.visualization.Query.setResponse([INSERT LOT OF JSON HERE])</p>
</body>
</html>
attr(,"class")
[1] "XMLDocumentContent"

I'm looking for what's on the "p" block. I did not found anything that could help me today.
So, how can I get those data?

Yes, a few times, but my problem isn't really in the htmlTreeParse function, it's more how to manipulate the data it returns. — whisust, Mar 06 '14 at 13:57
Sorry for not being clearer earlier. There's a gold mine of examples at the bottom. I'm sorry I can't give you any concrete `xpath` pointers, but I think that those examples are a good start. — Roman Luštrik, Mar 06 '14 at 15:50

score 0 · Accepted Answer · answered Mar 07 '14 at 06:18

If you want to run XPath on the document, you need to set useInternalNodes = TRUE (see the documentation on this argument). The following code should get you started with XPath.

[Note: When I run your code I get an error page, not the document you get.]

library(XML)
url <- "http://www.google.com/trends/fetchComponent?q=asdf,qwerty&cid=TIMESERIES_GRAPH_0&export=3"
doc <- htmlTreeParse(url, useInternalNodes = T)
# XPath examples
p        <- doc["//p"]        # nodelist of all the <p> elements (there aren't any...)
div      <- doc["//div"]      # nodelist of all the <div> elememts
scripts  <- doc["//script"]   # nodelist of all the <script> elements
b.script <- doc["//body/script"]    # nodelist of all <script> elements within the <body>

# title of the page
xmlValue(doc["//head/title"][[1]])
# [1] "Google Trends - An error has been detected"

Basically, you can use an XPath string as if it was an index into the document. So in your case,

xmlValue(doc["//p"][[1]])

should return the text contained in the (first) <p> element in doc

Thanks for your help, it worked fine. For the error page, maybe did you try several time in a row to access the URL and Google blocks you after (but if you manually paste it in any browser, you'll get the data). — whisust, Mar 07 '14 at 08:58

Accessing a structure field (XML package)

1 Answers1