0

I am trying to make a web crawler in JAVA which get links to all the languages a website is available in. Most of the multi-lingual websites have links to other languages like li > a[href] and that's easy to crawl. But how to get links from sites which have drop-down menu which call a javascript function to list the items.
Ex: https://www.alphagalileo.org, https://www.flickr.com/

EDIT: Code to get links which are in form li > a[href].

String REGEX = "^(EN|en|ENGLISH|English|english|...|AR|ar|Arabic)$";
//REGEX contains codes to all the available languages

Document document = Jsoup.connect(URL).get();
Elements linksOnPage = document.select("li > a[href]");

for (Element page : linksOnPage) {
    if(page.text().matches(REGEX)) {
        ArrayList<String> temporary = new ArrayList<>();
        temporary.add(URL); //URL of the parent site                  
        temporary.add(page.text()); //Language
        temporary.add(page.attr("abs:href")); //URL to the language site
        if(!langLinks.contains(temporary)) langLinks.add(temporary);
    }
}
  • Java or JavaScript? – VLAZ Jun 12 '19 at 10:05
  • @VLAZ I am trying to code it in Java. – Arihant Jain Jun 12 '19 at 10:06
  • So, is this related to JavaScript, then? – VLAZ Jun 12 '19 at 10:07
  • Yes. In some websites, the list in drop down menu is called through javascript functions. – Arihant Jain Jun 12 '19 at 10:08
  • What do you have so far? How are you crawling these pages? Do you use any libraries to do that? How are you executing JS? Have you tried *not* executing JS but using something like Selenium to drive all interactions with the page? – VLAZ Jun 12 '19 at 10:10
  • Edited the code to how I get links from other websites. I am looking for ways to execute JS from Java code. Also, I tried to use Selenium and its too slow and doesnt solve all the possible cases. – Arihant Jain Jun 12 '19 at 10:19

0 Answers0