3

Does anyone know whether I can scrape this site or this one with httr and rvest, or should I use selenium or phantomjs?

Both of the sites seem to be using ajax, and I cant seem to get through it.

Essentially what I am after is the following:

# I want this to return the titles of the listings, but I get character(0)
"https://www.sahibinden.com/satilik" %>% 
  read_html() %>% 
  html_nodes(".searchResultsItem .classifiedTitle") %>% 
  html_text() 

# I want this to return the prices of the listings, but I get 503
"https://www.hurriyetemlak.com/konut" %>% 
  read_html() %>% 
  html_nodes(".listing-item .list-view-price") %>% 
  html_text()

Any ideas with v8, or artificial sessions are welcome.

Also, any purely curl solutions are also welcome. I'll try to translate them into httr later :)

Thanks

deann
  • 756
  • 9
  • 24

1 Answers1

4

You will have to set cookies to make a successful request.

One should check whether the site (sahibinden) allows scraping.

  • robotstxt::paths_allowed(paths = "https://www.sahibinden.com/satilik", warn = FALSE) --> robotstxt does not seem to forbid it
  • if you update the site after deleting cookies in the browser the site does not allow access anymore and reports unusual behaviour --> indication for counter measures against scraping
  • to be sure one should read the terms of usage.

Therefore, i would share the "theoretical" code, but not the required cookie data, which is user dependent anyway.

Full code would read:

library(xml2)
library(httr)
library(magrittr)
library(DT)
url <- "https://www.sahibinden.com/satilik"

YOUR_COOKIE_DATA <- NULL
if(is.null(YOUR_COOKIE_DATA)){
  stop("You did not set your cookie data. 
        Also please check if terms of usage allow the scraping.")
}
response <- url %>% GET(add_headers(.headers = c(Cookie = YOUR_COOKIE_DATA))) %>%
            content(type = "text", encoding = "UTF-8")
xpathes <- data.frame(
    XPath0 = 'td[2]',
    XPath1 = 'td[3]/a[1]',
    XPath2 = 'td/span[1]',
    XPath3 = 'td/span[2]',
    XPath4 = 'td[4]',
    XPath5 = 'td[5]',
    XPath6 = 'td[6]',
    XPath7 = 'td[7]',
    XPath8 = 'td[8]'
)

nodes <- response %>% read_html %>% html_nodes(xpath = 
"/html/body/div/div/form/div/div/table/tbody/tr"
)

output <- lapply(xpathes, function(xpath){
    lapply(nodes, function(node) html_nodes(x = node, xpath = xpath) %>% 
    {ifelse(length(.), yes = html_text(.), no = NA)}) %>% unlist
})
output %>% data.frame %>% DT::datatable()

Concerning the right to scrape the website data. I try to follow: Should questions that violate API Terms of Service be flagged?. Although, in this case its "potential violation".

Reading cookies programmatically:

I am not sure it is possible to fully skip using the browser:

Tonio Liebrand
  • 17,189
  • 4
  • 39
  • 59
  • thanks for your answer.. however, the essence of the question is how do I get the data staying entirely in R, so how do I get the needed cookies data with or without hitting the ajax endpoint. `GET("https://www.sahibinden.com/satilik") %>% cookies()` return no cookies to be used, and copy pasting cookies from a browser is not an option, better use a headless one. Also, you are perfectly correct to raise concerns about scraping the site, however, these sites are examples of sites I am not able to scrape with rvest, so I was wondering whether I am missing sth. I am not planning on scraping them – deann Mar 30 '20 at 13:29
  • regarding the cookies. If you want a generic answer: I am not sure it is possible to fully skip using the browser, see my edit above. – Tonio Liebrand Mar 30 '20 at 13:53