I'm trying to scrape a series of websites that run a bunch of javascript on the DOM before it's done loading. This means I'm using a WebBrowser
instead of the friendlier WebClient
. The problem I'd like to solve is to wait until the WebBrowser.DocumentCompleted
event fires and then return WebBrowser.Document
. I then do some post processing on the HtmlDocument
but cannot get it to return yet.
The Code I Have
let downloadWebSite (address : string) =
let browser = new WebBrowser()
let browserContext = SynchronizationContext()
browser.DocumentCompleted.Add (fun _ ->
printfn "Document Loaded")
async {
do browser.Navigate(address)
let! a = Async.AwaitEvent browser.DocumentCompleted
do! Async.SwitchToContext(browserContext)
return browser.Document)
}
[downloadWebSite "https://www.google.com"]
|> Async.Parallel // there will be more addresses when working
|> Async.RunSynchronously
The Error
System.InvalidCastException: Specified cast is not valid.
at System.Windows.Forms.UnsafeNativeMethods.IHTMLDocument2.GetLocation()
at System.Windows.Forms.WebBrowser.get_Document()
at FSI_0058.downloadWebSite@209-41.Invoke(Unit _arg2) in C:\Temp\Untitled-1.fsx:line 209
at Microsoft.FSharp.Control.AsyncPrimitives.CallThenInvokeNoHijackCheck[a,b](AsyncActivation`1 ctxt, FSharpFunc`2 userCode, b result1)
at Microsoft.FSharp.Control.Trampoline.Execute(FSharpFunc`2 firstAction)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.FSharp.Control.AsyncResult`1.Commit()
at Microsoft.FSharp.Control.AsyncPrimitives.RunSynchronouslyInAnotherThread[a](CancellationToken token, FSharpAsync`1 computation, FSharpOption`1 timeout)
at Microsoft.FSharp.Control.AsyncPrimitives.RunSynchronously[T](CancellationToken cancellationToken, FSharpAsync`1 computation, FSharpOption`1 timeout)
at Microsoft.FSharp.Control.FSharpAsync.RunSynchronously[T](FSharpAsync`1 computation, FSharpOption`1 timeout, FSharpOption`1 cancellationToken)
at <StartupCode$FSI_0058>.$FSI_0058.main@()
Stopped due to error
What I think is happening
There are several issues that make me believe that I'm accessing the WebBrowser
from the wrong thread.1 2 3
Help requested
- Is the use of
Async.SwitchToContext(browserContext)
correct here? - Could the overall approach be simplified?
- Is there a concept I appear ignorant of?
- How do I get the
WebBrowser.Document
?