I am trying to make a web crawler in JAVA which get links to all the languages a website is available in. Most of the multi-lingual websites have links to other languages like li > a[href]
and that's easy to crawl. But how to get links from sites which have drop-down menu which call a javascript function to list the items.
Ex: https://www.alphagalileo.org, https://www.flickr.com/
EDIT: Code to get links which are in form li > a[href]
.
String REGEX = "^(EN|en|ENGLISH|English|english|...|AR|ar|Arabic)$";
//REGEX contains codes to all the available languages
Document document = Jsoup.connect(URL).get();
Elements linksOnPage = document.select("li > a[href]");
for (Element page : linksOnPage) {
if(page.text().matches(REGEX)) {
ArrayList<String> temporary = new ArrayList<>();
temporary.add(URL); //URL of the parent site
temporary.add(page.text()); //Language
temporary.add(page.attr("abs:href")); //URL to the language site
if(!langLinks.contains(temporary)) langLinks.add(temporary);
}
}