How to fix 'cannot open URL' error when scraping pictures using rvest

Question

I'm trying to scrape a picture using rvest, with this code:

url <- "https://fr.wikipedia.org/wiki/Robert_Jardillier"
webpage <- html_session(url)
link.titles <- webpage %>% html_nodes(".noarchive .image img")

img.url <- link.titles %>% html_attr("src")

download.file(img.url, "test.png", mode = "wb")

But when trying to download this, I have the following message :

trying URL '//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Robert_Jardillier_1932.jpg/220px-Robert_Jardillier_1932.jpg'
Error in download.file(img.url, "test.png", mode = "wb") : 
  cannot open URL '//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Robert_Jardillier_1932.jpg/220px-Robert_Jardillier_1932.jpg'
In addition: Warning message:
In download.file(img.url, "test.png", mode = "wb") :
  URL '//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Robert_Jardillier_1932.jpg/220px-Robert_Jardillier_1932.jpg': status was 'URL using bad/illegal format or missing URL'

From the `download.file` help: "The url must start with a scheme such as ‘⁠http://⁠’, ‘⁠https://⁠’, ‘⁠ftp://⁠’ or ‘⁠file://⁠’. Which methods support which schemes varies by R version, but method = "auto" will try to find a method which supports the scheme." — Ric, Oct 18 '22 at 19:45

score 0 · Accepted Answer · answered Oct 18 '22 at 19:46

0

Try:

download.file(paste0("http:",img.url), "test.png", mode = "wb")

answered Oct 18 '22 at 19:46

Ric

5,362
1
10
23

score 0 · Answer 2 · answered Oct 18 '22 at 20:12

This worked with me.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
})

url <- "https://fr.wikipedia.org/wiki/Robert_Jardillier"
page <- read_html(url)

page %>%
  html_elements("a") %>%
  html_attr("href") %>%
  grep("Robert_Jardillier.*\\.jpg", ., value = TRUE) %>%
  unique() %>%
  basename() %>%
  paste0(url, "#/media/", .) %>%
  download.file(destfile = "test.jpg")

How to fix 'cannot open URL' error when scraping pictures using rvest

2 Answers2