HTML scraping - R scrapR

Question

I am trying to parse data encoded in HTML format. Example of the string I am trying to parse is:

Simplify the polynomial by combining like terms. <img src=\"/flx/math/inline/3x%2B12-11x%2B14\" class=\"x-math\" alt=\"3x+12-11x+14\" />

I want to get the text before <img and the text in alt=

Desired output:

Simplify the polynomial by combining like terms. 3x+12-11x+14

I tried scrapeR.

y1 = scrape (str1)  # the above string is in str1 (as a vector)

I get the following error message

Error in which(value == defs) : 
  argument "code" is missing, with no default

Has anyone played with scrapeR. I am not sure what "code" refers to as it is an option and is not described in the manual. Just trying to see which default value is affecting this.

the `scrape` function normally takes a URL as it's first unnamed parameter according to the documentation. What about `y1 = scrape(object="str1")`? — MrFlick, Jun 28 '14 at 19:08
it accepts str1. y1=scrape(str1) produces the error. y1=scrape (object=str1) produces a different kind of error - unable to locate object str1. I think object=xxx is for objects with URLs etc. — user3763914, Jun 28 '14 at 19:15
It should be `y = scrape(object="str1")` not `y1 = scrape(object=str1)`. See the documentation at: http://www.rdocumentation.org/packages/scrapeR/functions/scrape — MrFlick, Jun 28 '14 at 19:22
y = scrape(object="str1") puts the entire HTML wrapper in y. It now has etc. It is now a complete HTML file. (I was expecting the opposite result. (Appreciate your help). — user3763914, Jun 28 '14 at 19:59
@MrFlick, thanks for editing my original post and formatting it properly. (I will learn to do this). — user3763914, Jun 28 '14 at 20:00
But now what's the `class()` and `str()` of `y`? It should be some sort of iterable list i think now. — MrFlick, Jun 28 '14 at 20:00

score 1 · Answer 1 · answered Jun 28 '14 at 20:12

1

Here's one way to extract that information

str1<-"Simplify the polynomial by combining like terms. <img src=\"/flx/math/inline/3x%2B12-11x%2B14\" class=\"x-math\" alt=\"3x+12-11x+14\" />"

library(scrapeR)    
y<-scrape(object="str1")[[1]] #just get the first result

pretext <- sapply(xpathSApply(y, "//img/preceding::text()"), xmlValue)
alttext <- xpathSApply(y, "//img/@alt")

paste(pretext, alttext)
#[1] "Simplify the polynomial by combining like terms.  3x+12-11x+14"

The scrape() will return HTML/XML like document that you can work with using functions like xpathSApply to find nodes and extract values.

answered Jun 28 '14 at 20:12

MrFlick

195,160
17
277
295

Thank you very much. It works. Now I am going to figure out why it works! (thanks for putting me in the right direction) – user3763914 Jun 28 '14 at 21:24
found a good tutorial on webscraping using R. just posting it for future use. http://files.meetup.com/1503964/2010-06-24_WebScrapeIntro.pdf – user3763914 Jun 28 '14 at 21:30

HTML scraping - R scrapR

1 Answers1