0

I am trying to get a text file from a URL. From the browser, its fairly simple. I just have to "save as" the from the URL and I get the file I want. At first, i had some trouble logging in using rvest (see [https://stackoverflow.com/questions/66352322/how-to-get-txt-file-from-password-protected-website-jsp-in-r][1])(I uploaded a couple of probably useful picture there). When I use the following code:

fileurl <- "http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T"
session(fileurl)

I get the following (note how I am redirected to a different URL, as happens in the browser when you try to get to the fileurl without first logging in):

<session> http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T
  Status: 200
  Type:   text/html; charset=ISO-8859-1
  Size:   84

I managed to log in using the following code:

#Define URLs
loginurl <- "http://www1.bolsadecaracas.com/esp/usuarios/customize.jsp"
fileurl <- "http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T"

#Create session
pgsession <- session(loginurl)
pgform<-html_form(pgsession)[[1]]   #Get form

#Create a fake submit button as form does not have one
fake_submit_button <- list(name = NULL,
                           type = "submit",
                           value = NULL,
                           checked = NULL,
                           disabled = NULL,
                           readonly = NULL,
                           required = FALSE)
attr(fake_submit_button, "class") <- "input"    
pgform[["fields"]][["submit"]] <- fake_submit_button

#Create and submit filled form
filled_form<-html_form_set(pgform, login="******", passwd="******")
session_submit(pgsession, filled_form)

#Jump to new url
loggedsession <- session_jump_to(pgsession, url = fileurl)

#Output
loggedsession
  

It seems to me that the login was succesful, as the session output is the exact same size than the .txt file when I download it and I am no longer redirected. See the output.

<session> http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T
  Status: 200
  Type:   text/plain; charset=ISO-8859-1
  Size:   32193

However, whenever I try to extract the content of the session with read_html() or the like, i get the following error: "Error: Page doesn't appear to be html.". I dont know if it has anything to do with the "Type: text/plain" of the session.

When I run

loggedsession[["response"]][["content"]]

I get

  [1] 0d 0a 0d 0a 0d 0a 0d 0a 0d 0a 7c 30 32 2f 30 31 2f 32 30 31 39 7c 52 7c 31 34 2c 39 30 7c 31 35 2c
  [34] 30 30 7c 31 37 2c 38 33 7c 31 33 2c 35 30 7c 39 7c 31 33 2e 35 33 33 7c 32 30 33 2e 30 36 30 2c 31
  [67] 39 7c 0a 7c 30 33 2f 30 31 2f 32 30 31 39 7c 52 7c 31 35 2c 30 30 7c 31 37 2c 39 38 7c 31 37 2c 39

Any help on how to extract the text file??? Would be greatly appreciated.

PS: At one point, just playing with functions I managed to get something that would have worked with httr::: GET(fileurl). That was after playing with rvest functions and managing to log in. However, after closing and opening RStudio I was not able to get the same output with that function.

  • so you got back what looks like hexadecimal. Anything here: https://stackoverflow.com/questions/37404921/getting-binary-data-when-using-post-request-in-httr-package `rawToChar(as.raw(n))` - though that looks like a patch – QHarr Mar 12 '21 at 00:59
  • I believe you can use httr content function to convert the data. Here is example to look at https://stackoverflow.com/questions/43459356/httr-csv-content-reading-as-integer-instead-of-double. The are better question/answers out there if you look. – Dave2e Mar 12 '21 at 02:35

1 Answers1

0

Because rvest uses httr package internally, you can use the httr and base to save your file. The key to the solution is that your response (in terms of the httr package) is in the session object:

library(rvest)
library(httr)

httr::content(loggedsession$response, as = "text") %>%
   cat(file = "your_file.txt")

More importantly, if your file were binary (e.g. a zip archive), you would have to do:

library(rvest)
library(httr)

httr::content(loggedsession$response, as = "raw") %>%
    writeBin(con = 'your_file.zip')
ekotov
  • 79
  • 5