0

I'm trying to scrape a series of websites that run a bunch of javascript on the DOM before it's done loading. This means I'm using a WebBrowser instead of the friendlier WebClient. The problem I'd like to solve is to wait until the WebBrowser.DocumentCompleted event fires and then return WebBrowser.Document. I then do some post processing on the HtmlDocument but cannot get it to return yet.

The Code I Have

let downloadWebSite (address : string) = 
    let browser = new WebBrowser()
    let browserContext = SynchronizationContext()
    browser.DocumentCompleted.Add (fun _ ->
        printfn "Document Loaded")

    async {
        do browser.Navigate(address)
        let! a = Async.AwaitEvent browser.DocumentCompleted
        do! Async.SwitchToContext(browserContext)
        return browser.Document)
    }


[downloadWebSite "https://www.google.com"]
|> Async.Parallel // there will be more addresses when working
|> Async.RunSynchronously

The Error

System.InvalidCastException: Specified cast is not valid.
   at System.Windows.Forms.UnsafeNativeMethods.IHTMLDocument2.GetLocation()
   at System.Windows.Forms.WebBrowser.get_Document()
   at FSI_0058.downloadWebSite@209-41.Invoke(Unit _arg2) in C:\Temp\Untitled-1.fsx:line 209
   at Microsoft.FSharp.Control.AsyncPrimitives.CallThenInvokeNoHijackCheck[a,b](AsyncActivation`1 ctxt, FSharpFunc`2 userCode, b result1)
   at Microsoft.FSharp.Control.Trampoline.Execute(FSharpFunc`2 firstAction)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.FSharp.Control.AsyncResult`1.Commit()
   at Microsoft.FSharp.Control.AsyncPrimitives.RunSynchronouslyInAnotherThread[a](CancellationToken token, FSharpAsync`1 computation, FSharpOption`1 timeout)
   at Microsoft.FSharp.Control.AsyncPrimitives.RunSynchronously[T](CancellationToken cancellationToken, FSharpAsync`1 computation, FSharpOption`1 timeout)
   at Microsoft.FSharp.Control.FSharpAsync.RunSynchronously[T](FSharpAsync`1 computation, FSharpOption`1 timeout, FSharpOption`1 cancellationToken)
   at <StartupCode$FSI_0058>.$FSI_0058.main@()
Stopped due to error

What I think is happening

There are several issues that make me believe that I'm accessing the WebBrowser from the wrong thread.1 2 3

Help requested

  • Is the use of Async.SwitchToContext(browserContext) correct here?
  • Could the overall approach be simplified?
  • Is there a concept I appear ignorant of?
  • How do I get the WebBrowser.Document?
jks612
  • 1,224
  • 1
  • 11
  • 20

1 Answers1

0

The problem is in this line:

let browserContext = SynchronizationContext()

You manually created a new instance of SynchronizationContext but didn't associate it with the UI thread or any thread. That's why the program crashes when you access the browser.Document which must be accessed on the UI thread.

To solve this problem, simply use the existing SynchronizationContext which was already associated with the UI thread:

let browserContext = SynchronizationContext.Current

I assumed that the downloadWebSite function is called on the UI thread. If it is not, you can pass the context from somewhere into the function, or use a global variable.

A better design

Althought with Async.SwitchToContext you can make sure that the next line accesses and returns the document in the UI thread, but the client code which receives the document may run on a non-UI thread. A better design is to use a continuation function. Instead of directly returning a document, you can return a SomeType value produced by a continuation function passed into downloadWebSite as a parameter. By this way, the continuation function is ensured to be run on UI thread:

let downloadWebSite (address : string) cont =
    let browser = new WebBrowser()
    let browserContext = SynchronizationContext.Current
    browser.DocumentCompleted.Add (fun _ ->
        printfn "Document Loaded")

    async {
        do browser.Navigate(address)
        let! a = Async.AwaitEvent browser.DocumentCompleted
        do! Async.SwitchToContext(browserContext)
        // the cont function is ensured to be run on UI thread:
        return cont browser.Document }

[downloadWebSite "https://www.google.com" (fun document -> (*safely access document*))]
|> Async.Parallel
|> Async.RunSynchronously
Nghia Bui
  • 3,694
  • 14
  • 21
  • Did you get this to return? I put a print line below the `let!` statement and the print goes to console but it doesn't return. I think my machine is hanging on the `do! Async.SwitchToContext` line. – jks612 Nov 19 '18 at 18:52