1

I am trying to download a file from a URL using the polite package in R. Here is the code I am using:

library(polite)

# URL of the file to download
eprice_xml_products_1 <- "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz"

# Create a polite session
session <- bow(eprice_xml_products_1)

# Download the file using rip function
file_path <- rip(session, destfile = "xml_1.gz")

print(file_path)

I have also tried with this function:


    bow(eprice_xml_products_1) %>%
      nod("https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz") %>%
      rip()

But I get this error:


    trying URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz'
    Error in fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz",  : 
      cannot open URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz'
    In addition: Warning messages:
    1: In fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz",  :
      downloaded length 0 != reported length 334
    2: In fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz",  :
      cannot open URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz': HTTP status was '403 Forbidden'

If I just open the link with my browser the download of the file starts immediately

What am I missing?

Andrea
  • 105
  • 10
  • I'm getting the message `"The site is being updated, we will be back online at 03.00."` in Italian, `"Il sito è in aggiornamento, saremo di nuovo online alle ore 03.00."`. – Rui Barradas Aug 01 '23 at 20:43

1 Answers1

2

That page blocks requests for the url you are trying to access, when the user-agent value in the request headers is not a regular browser (Firefox, Chrome, ...). To make this work, you can change your user agent value to that of a Browser. Below is an example that works with utils::download.file(). A similar strategy might be available for polite.

# Set User Agent to current Firefox
  options(HTTPUserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/115.0")
  download.file("https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz", "Sitemap_Elettrodomestici_1.xml.gz")
  
  # Load XML from file
  library(xml2)
  read_xml("Sitemap_Elettrodomestici_1.xml.gz")
#> {xml_document}
#> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
#>  [1] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
#>  [2] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
#>  [3] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
#>  [4] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DHaier%2D ...
#>  [5] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DMIDEA%2DFrigorif ...
#>  [6] <url>\n  <loc>https://www.eprice.it/Accessori%2DFrigoriferi%2DELECTROLUX ...
#>  [7] <url>\n  <loc>https://www.eprice.it/accessori%2DIMPERIA/d%2D1597166</loc ...
#>  [8] <url>\n  <loc>https://www.eprice.it/accessori%2DIMPERIA/d%2D2489361</loc ...
#>  [9] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DDe%20Longhi/d% ...
#> [10] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DELECTROLUX/d%2 ...
#> [11] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DELECTROLUX/d%2 ...
#> [12] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DELUX%20INC/d%2 ...
#> [13] <url>\n  <loc>https://www.eprice.it/accessori%2DKENWOOD/d%2D5551714</loc ...
#> [14] <url>\n  <loc>https://www.eprice.it/accessori%2DKENWOOD/d%2D7625838</loc ...
#> [15] <url>\n  <loc>https://www.eprice.it/accessori%2DKitchenAid/d%2D50118434< ...
#> [16] <url>\n  <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
#> [17] <url>\n  <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
#> [18] <url>\n  <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
#> [19] <url>\n  <loc>https://www.eprice.it/accessori%2Dmacchine%2Dcaffe%2DDE%20 ...
#> [20] <url>\n  <loc>https://www.eprice.it/accessori%2Dmacchine%2Dcaffe%2DDE%20 ...
#> ...
Till
  • 3,845
  • 1
  • 11
  • 18
  • Thank you @Till, I tried to use a different user agent with the bow function but I think that I have used an old one since I was getting the error. – Andrea Aug 01 '23 at 21:36
  • 1
    how do you check that the page blocks non browser requests? @Till – Andrea Aug 02 '23 at 20:01
  • Getting the HTTP code 403 via R while being able to access the URL in the Browser usually means that non-Browser user agents are being blocked. I don't think there is a way to determine this from the 403. – Till Aug 03 '23 at 14:34