I am trying to get a text file from a URL. From the browser, its fairly simple. I just have to "save as" the from the URL and I get the file I want. At first, i had some trouble logging in using rvest (see [https://stackoverflow.com/questions/66352322/how-to-get-txt-file-from-password-protected-website-jsp-in-r][1])(I uploaded a couple of probably useful picture there). When I use the following code:
fileurl <- "http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T"
session(fileurl)
I get the following (note how I am redirected to a different URL, as happens in the browser when you try to get to the fileurl without first logging in):
<session> http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T
Status: 200
Type: text/html; charset=ISO-8859-1
Size: 84
I managed to log in using the following code:
#Define URLs
loginurl <- "http://www1.bolsadecaracas.com/esp/usuarios/customize.jsp"
fileurl <- "http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T"
#Create session
pgsession <- session(loginurl)
pgform<-html_form(pgsession)[[1]] #Get form
#Create a fake submit button as form does not have one
fake_submit_button <- list(name = NULL,
type = "submit",
value = NULL,
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "input"
pgform[["fields"]][["submit"]] <- fake_submit_button
#Create and submit filled form
filled_form<-html_form_set(pgform, login="******", passwd="******")
session_submit(pgsession, filled_form)
#Jump to new url
loggedsession <- session_jump_to(pgsession, url = fileurl)
#Output
loggedsession
It seems to me that the login was succesful, as the session output is the exact same size than the .txt file when I download it and I am no longer redirected. See the output.
<session> http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T
Status: 200
Type: text/plain; charset=ISO-8859-1
Size: 32193
However, whenever I try to extract the content of the session with read_html() or the like, i get the following error: "Error: Page doesn't appear to be html.". I dont know if it has anything to do with the "Type: text/plain" of the session.
When I run
loggedsession[["response"]][["content"]]
I get
[1] 0d 0a 0d 0a 0d 0a 0d 0a 0d 0a 7c 30 32 2f 30 31 2f 32 30 31 39 7c 52 7c 31 34 2c 39 30 7c 31 35 2c
[34] 30 30 7c 31 37 2c 38 33 7c 31 33 2c 35 30 7c 39 7c 31 33 2e 35 33 33 7c 32 30 33 2e 30 36 30 2c 31
[67] 39 7c 0a 7c 30 33 2f 30 31 2f 32 30 31 39 7c 52 7c 31 35 2c 30 30 7c 31 37 2c 39 38 7c 31 37 2c 39
Any help on how to extract the text file??? Would be greatly appreciated.
PS: At one point, just playing with functions I managed to get something that would have worked with httr::: GET(fileurl). That was after playing with rvest functions and managing to log in. However, after closing and opening RStudio I was not able to get the same output with that function.