0

I'm trying to scrape a website encoded in UTF-8 using the httr package, but apparently the content function of that package only allows for specifying the encoding if you parse the website as text. Unfortunately, I cannot parse it as text, since I would like to use xpath queries on it afterwards. Here's an example:

library(XML)
library(httr)

page <- GET("http://ec.europa.eu/archives/commission_2004-2009/index_en.htm")
test <- content(page, as = "parsed")
# Get a list of names, many of which contain non-standard characters
xpathSApply(test, "//img", xmlGetAttr, "alt") 

# This gives the correct encoding, but outputs a character vector, 
# on which I cannot use xpath queries
test <- content(page, as = "text", encoding = "utf-8")

Update:

# htmlParse returns a parsed document, but the non-standard characters are 
# not properly encoded, i.e. the result is the same whether or not I specify the 
# "encoding" argument
test <- htmlParse(page, encoding = "UTF-8")

# Non-standard characters in names still not properly encoded
xpathSApply(test, "//img", xmlGetAttr, "alt")
user2987808
  • 1,387
  • 1
  • 12
  • 28
  • What do you mean by "doesn't work"? Because I get a result, albeit an *access denied* message, but it's still a parsed XML document. – Rich Scriven Aug 27 '14 at 07:54
  • I, too, get a parsed XML document with `content(page, as="parsed")` [httr v0.4.0.99, R3.1.1, OS X] and the `xpathSApply` gives me a 113 element vector with the names from the `img` `alt` tag. NOTE: When you **do** get it working you shld prbly change the XPath to `//img[@class='comm_img']` if you just want the names of the commissioners. – hrbrmstr Aug 27 '14 at 11:05
  • Thanks, I appreciate the help. I'm using the same httr and R versions, but I'm on a Windows machine. Could it have something to do what that, you think? – user2987808 Aug 27 '14 at 13:29
  • Why do you think you can't supply encoding for `as = "parsed"`? This works for me: `content(page, as = "parsed", encoding = "utf-8")` (maybe I fixed it in the dev version?) – hadley Aug 27 '14 at 17:11

1 Answers1

0

Try:

 test <- htmlParse("http://ec.europa.eu/archives/commission_2004-2009/index_en.htm")
 res <- xpathSApply(test, "//img", xmlGetAttr, "alt")
 tail(res)
 #[1] "Slovakian"   "PDF"         "PDF"         "PDF"         "PDF - 66 KB"
#[6] "français" 

Using your codes, (first and second)

 tail(res1)
 #[1] "Slovakian"   "PDF"         "PDF"         "PDF"         "PDF - 66 KB"
 #[6] "français"  

  tail(res2)
 #[1] "Slovakian"   "PDF"         "PDF"         "PDF"         "PDF - 66 KB"
 #[6] "français"  
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for this. So I guess the problem lies in the fact that I use the GET function first (which I have to do to use my work proxy server). Any ideas on why that is? – user2987808 Aug 27 '14 at 08:45
  • @user2987808 Sorry I am not sure about that. – akrun Aug 27 '14 at 08:46