0

I am using rvest R package to scrape a PDF file from this webpage but the final link is exposed (as a bitstream url - whatever it is) after I click on the exposed url by name AC1-96-21-01-2011.pdf. The final pdf file is tucked in here hidden from access. This blocks all attempts of rvest function read_html() as the final pdf file opens only on clicking on the previous link (on href). Copy pasting the xml node that is not allowing me to enter into the pdf file.

<a href="/judgments/handle/123456789/701">Arbitration Case - AC</a>

The final file is on this url which is not exposed in the href node. http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf

So as a summary how do I access the pdf file link using rvest that is not found in the href attribute as explained above.

I tried to search bitstream but it takes my to something else.

Lazarus Thurston
  • 1,197
  • 15
  • 33

1 Answers1

1

You're looking at the wrong node I think:

library(rvest)

"http://judgmenthck.kar.nic.in/judgments/handle/123456789/563560" %>%
read_html()                                                       %>%
html_nodes(xpath = "//td/a[@target='_blank']")                    %>%
html_attr("href")                                                 %>% 
unique()                                                          %>% 
{grep("[.]pdf", ., value = T)}                                    %>%
paste0("http://judgmenthck.kar.nic.in", .)                         ->
pdf_url

print(pdf_url)
# [1] "http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf"
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Aha, I learnt about the `target="_blank"` attribute value today. Was always filtering nodes on `href` attribute and hence missing the actual link. Thanks a lot, I will accept your answer @Allan. – Lazarus Thurston Jan 15 '20 at 11:49
  • could you recommend a suitable tutorial on `xpath`? I tried a few but they stop after giving elementary examples. – Lazarus Thurston Jan 15 '20 at 12:05
  • 1
    HI Lazarus. I think I started with the simple tutorial on w3 schools at `https://www.w3schools.com/xml/xpath_intro.asp` and built up knowledge as needed from Stack Overflow. If you type `[xpath]` into the search bar at the top of the Stack Overflow page you'll find some great questions and answers. – Allan Cameron Jan 15 '20 at 13:23