Java scraping website after async scripts are loaded

Question

Little background, I'm trying to given an option for customer to add HTML directly and publish a single page website(like blogspot). This brought scammers problem, so I created a microservice that blocks publishing website based on HTML content.

Initially I used JSoup for getting HTML from website, now the scammer has mutated and is using an external website for loading script and it is loaded in async <script src="https://yolologroyopuedo.us/?api=1&lan=fbcacaroto" type="text/javascript" async="true"></script>

So my initial rendered HTML does not have any scam content so it evades the website blocking. I'm trying to scrape website content after the script has loaded completely or after some fixed time.

I tried but I'm always getting pre hacking script loaded HTML.

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

and tried htmlunit

        WebClient webClient = new WebClient();
        webClient.getOptions().setJavaScriptEnabled(false);
        webClient.getOptions().setCssEnabled(false);
        HtmlPage page = webClient.getPage("http://example.com");

is there an elegant way to scrape a website after all scripts are loaded in Java?

score 1 · Answer 1 · answered Feb 23 '22 at 06:08

The script you are talking about is executed in you browser - if you like to get the page after the script

you can't use jsoup because jsoup has no js support at all and therefore can't process the script
with HtmlUnit you have to enable js support and then maybe wait for the execution (e.g. webclient.waitForBackgroundJavaScript()) of the script. After that the dom tree in the page is updated and you can use the usual selectors to get what you like to know.

If you still have problems please open an HtmlUnit issue on github and include the url you ear working with to give us a chance to reproduce your case.

Java scraping website after async scripts are loaded

1 Answers1