0

I am learning text mining using R. I am trying to find all the links in a HTML document.

I tried getHTMLLinks() but it is showing following error:

url = "https://elections.maryland.gov/elections/2012/election_data/index.html"
getHTMLLinks(url)

character(0)
Warning message:
XML content does not seem to be XML: 'https://elections.maryland.gov/elections/2012/election_data/index.html' 

so I tired "rvest" package to find the links. The code is as follow:

links = xml2::read_html(url) %>% #read html link
  html_nodes("a") %>% #select a node
  html_attr("href") %>% #from a node select all href (hyperlink) tags
  .[grep("general.csv",.,ignore.case = T)]

It give all links in vector format.

head(links)

"State_Congressional_Districts_2012_General.csv" "State_Legislative_Districts_2012_General.csv"  
[3] "All_By_Precinct_2012_General.csv"               "Allegany_County_2012_General.csv"              
[5] "Allegany_By_Precinct_2012_General.csv"          "Anne_Arundel_County_2012_General.csv" 

These all the links are just names listed in href tag. But actually these all are hyperlinks to a table.

It would be really great if anyone can help me that how can I extract the final links instead of name of these hyperlinks?

  • If you want the full url leading to each of those files, affix the original url to the elements of that vector: `paste("https://elections.maryland.gov/elections/2012/election_data", links, sep = "/")` – paqmo Apr 14 '20 at 12:15
  • thanks for the workaround. But it might be possible that a table in web page is linked to different web site. – Rohit parihar Apr 14 '20 at 12:21
  • 1
    if that is the case, it will show the entire url--try this, for example: `read_html("https://en.wikipedia.org/wiki/Statistics") %>% html_nodes("a") %>% html_attr("href")` – paqmo Apr 14 '20 at 12:24
  • I am new to HTML so I don't know this concept. Thanks for the help. – Rohit parihar Apr 14 '20 at 12:34

0 Answers0