I am learning text mining using R. I am trying to find all the links in a HTML document.
I tried getHTMLLinks() but it is showing following error:
url = "https://elections.maryland.gov/elections/2012/election_data/index.html"
getHTMLLinks(url)
character(0)
Warning message:
XML content does not seem to be XML: 'https://elections.maryland.gov/elections/2012/election_data/index.html'
so I tired "rvest" package to find the links. The code is as follow:
links = xml2::read_html(url) %>% #read html link
html_nodes("a") %>% #select a node
html_attr("href") %>% #from a node select all href (hyperlink) tags
.[grep("general.csv",.,ignore.case = T)]
It give all links in vector format.
head(links)
"State_Congressional_Districts_2012_General.csv" "State_Legislative_Districts_2012_General.csv"
[3] "All_By_Precinct_2012_General.csv" "Allegany_County_2012_General.csv"
[5] "Allegany_By_Precinct_2012_General.csv" "Anne_Arundel_County_2012_General.csv"
These all the links are just names listed in href tag. But actually these all are hyperlinks to a table.
It would be really great if anyone can help me that how can I extract the final links instead of name of these hyperlinks?