-1

My aim is to load a document from a web server and then parse its DOM for specific content. Loading the DOM is my problem.

I am trying to use a javafx.scene.web.WebEngine as this seems as if it should be able to do all the necessary mechanics, including javascript execution, which may affect the final DOM.

When loading a document, it appears to get stuck in the RUNNING state and never reaches the SUCCEEDED state, which I believe is required before accessing the DOM from WebEngine.getDocument().

This occurs whether loading from a URL or literal content (as used in this minimal example).

Can anyone see what I’m doing wrong, or misunderstanding?

Thanks in advance for any help.

import java.util.concurrent.ExecutionException;
import org.w3c.dom.Document;
import javafx.application.Platform;
import javafx.concurrent.Task;
import javafx.concurrent.Worker;
import javafx.embed.swing.JFXPanel;
import javafx.scene.web.WebEngine;

public class WebEngineProblem {
    private static Task<WebEngine> getEngineTask() {
        Task<WebEngine> task = new Task<>() {
            @Override
            protected WebEngine call() throws Exception {
                WebEngine webEngine = new WebEngine();
                final Worker<Void> loadWorker = webEngine.getLoadWorker();
                loadWorker.stateProperty().addListener((obs, oldValue, newValue) -> {
                    System.out.println("state:" + newValue);
                    if (newValue == State.SUCCEEDED) {
                        System.out.println("finished loading");
                    }    
                });
                webEngine.loadContent("<!DOCTYPE html>\r\n" + "<html>\r\n" + "<head>\r\n" + "<meta charset=\"UTF-8\">\r\n"
                    + "<title>Content Title</title>\r\n" + "</head>\r\n" + "<body>\r\n" + "<p>Body</p>\r\n" + "</body>\r\n"
                    + "</html>\r\n");
                State priorState = State.CANCELLED; //should never be CANCELLED
                double priorWork = Double.NaN;
                while (loadWorker.isRunning()) {
                    final double workDone = loadWorker.getWorkDone();
                    if (loadWorker.getState() != priorState || priorWork != workDone) {
                        priorState = loadWorker.stateProperty().getValue();
                        priorWork = workDone;
                        System.out.println(priorState + " " + priorWork + "/" + loadWorker.getTotalWork());
                    }
                    Thread.sleep(1000);
                }
                return webEngine;
            }
        };
        return task;
    }

    public static void main(String[] args) {
        new JFXPanel(); // Initialise the JavaFx Platform
        WebEngine engine = null;
        Task<WebEngine> task = getEngineTask();
        try {
            Platform.runLater(task);
            Thread.sleep(1000); 
            engine = task.get(); // Never completes as always RUNNING
        }
        catch (InterruptedException | ExecutionException e) {
            e.printStackTrace();
        }
        // This code is never reached as the content never completes loading
        // It would fail as it's not on the FX thread.
        Document doc = engine.getDocument();
        String content = doc.getTextContent();
        System.out.println(content);
    }

}
Dragonthoughts
  • 2,180
  • 8
  • 25
  • 28
  • 1
    Tasks are designed to be run on a background thread: you are running this task on the FX Application thread. The state change to `SUCCEEDED` also has to happen on the FX Application thread, so it can't change state until your task completes. Since your while loop won't complete until after the `loadWorker` moves out of the `RUNNING` state you effectively have a weird form of deadlock. – James_D Dec 28 '17 at 17:54
  • Surely, the `Platform.runLater()' call is the way of forcing the task to run on the FXApplication thread, but the worker thread is separate and distinct? webEngine.loadContent() returns immediately, so the loading must be occurring on an separate worker thread. "Loading always happens on a background thread. Methods that initiate loading return immediately after scheduling a background job. To track progress and/or cancel a job, use the Worker instance available from the getLoadWorker() method." from https://docs.oracle.com/javase/8/javafx/api/javafx/scene/web/WebEngine.html – Dragonthoughts Dec 28 '17 at 19:08
  • 1
    The worker thread is separate and distinct, but the actual change to the worker's `stateProperty()` has to happen on the FX Application Thread. (Basically these properties are all single-threaded.) So in the worker thread's implementation, somewhere there is a call to `Platform.runLater(...)` that updates the state. That call can't actually happen if you have blocked the FX Application Thread. (Basically, you should never block the FX Application Thread, even if you are running in a "headless" mode.) – James_D Dec 28 '17 at 19:11

1 Answers1

1

The change to a Worker's state property will occur on the FX Application Thread, even though that worker is running on a background thread. (JavaFX properties are essentially single-threaded.) Somewhere in the implementation of the thread that loads the web engine's content, there is a call to Platform.runLater(...) that changes the state of the worker.

Since your task blocks until the state of the worker has changed, and since you make your task run on the FX Application Thread, you have essentially deadlocked the FX Application Thread: the change to the load worker's state can't occur until your task completes (because it is running on the same thread), and your task can't complete until the state changes (as that's what you programmed the task to do).

It is basically always an error to block the FX Application Thread. Instead, you should block another thread until the conditions you want are true (web engine is created and loading thread completes), and then execute the next thing you want to do when that occurs (using Platform.runLater(...) again if it needs to be executed on the FX Application Thread).

Here is an example doing what I think you are trying to do:

import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.FutureTask;

import org.w3c.dom.Document;

import javafx.application.Platform;
import javafx.concurrent.Worker;
import javafx.concurrent.Worker.State;
import javafx.embed.swing.JFXPanel;
import javafx.scene.web.WebEngine;

public class WebEngineProblem {

    public static void main(String[] args) throws InterruptedException, ExecutionException {
        new JFXPanel(); // Initialise the JavaFx Platform

        CountDownLatch loaded = new CountDownLatch(1);

        FutureTask<WebEngine> createEngineTask = new FutureTask<WebEngine>( () -> {
            WebEngine webEngine = new WebEngine();
            final Worker<Void> loadWorker = webEngine.getLoadWorker();
            loadWorker.stateProperty().addListener((obs, oldValue, newValue) -> {
                System.out.println("state:" + newValue);
                if (newValue == State.SUCCEEDED) {
                    System.out.println("finished loading");
                    loaded.countDown();
                }    
            });
            webEngine.loadContent("<!DOCTYPE html>\r\n" + "<html>\r\n" + "<head>\r\n" + "<meta charset=\"UTF-8\">\r\n"
                + "<title>Content Title</title>\r\n" + "</head>\r\n" + "<body>\r\n" + "<p>Body</p>\r\n" + "</body>\r\n"
                + "</html>\r\n");
            return webEngine ;
        });

        Platform.runLater(createEngineTask);
        WebEngine engine = createEngineTask.get();
        loaded.await();

        Platform.runLater(() -> {
            Document doc = engine.getDocument();
            String content = doc.getDocumentElement().getTextContent();
            System.out.println(content);
        });
    }

}
James_D
  • 201,275
  • 16
  • 291
  • 322