5

Hi I was making a crawler for a site. After about 3 hours of crawling, my app stopped on a WebException. below are my code in c#. client is predefined WebClient object that will be disposed every time gameDoc has already been processed. gameDoc is a HtmlDocument object (from HtmlAgilityPack)

while (retrygamedoc)
{
    try
    {
        gameDoc.LoadHtml(client.DownloadString(url)); // this line caused the exception
        retrygamedoc = false;
    }
    catch
    {
        client.Dispose();
        client = new WebClient();

        retrygamedoc = true;
        Thread.Sleep(500);
    }
}

I tried to use code below (to keep the webclient fresh) from this answer

while (retrygamedoc)
{
    try
    {
        using (WebClient client2 = new WebClient())
        {
            gameDoc.LoadHtml(client2.DownloadString(url)); // this line cause the exception
            retrygamedoc = false;
        }
    }
    catch
    {
        retrygamedoc = true;
        Thread.Sleep(500);
    }
}

but the result is still the same. Then I use StreamReader and the result stays the same! below are my code using StreamReader.

while (retrygamedoc)
{
    try
    {
        // using native to check the result
        HttpWebRequest webreq = (HttpWebRequest)WebRequest.Create(url);
        string responsestring = string.Empty;
        HttpWebResponse response = (HttpWebResponse)webreq.GetResponse(); // this cause the exception
        using (StreamReader reader = new StreamReader(response.GetResponseStream()))
        {
            responsestring = reader.ReadToEnd();
        }
        gameDoc.LoadHtml(client.DownloadString(url));

        retrygamedoc = false;
    }
    catch
    {
        retrygamedoc = true;
        Thread.Sleep(500);
    }
}

What should I do and check? I am so confused because I got am able to crawl on some pages, on the same site, then in about 1000 reasults, it cause the exception. the message from exception is only The request was aborted: The connection was closed unexpectedly. and the status is ConnectionClosed

PS. the app is a desktop form app.

update :

Now I am skipping the values and turned them to null so that the crawling can goes on. But if the data is really needed, I still have to update the crawling result manually, which is tiring because the result contains thousands of record. Please help me.

example :

it was like you have downloaded like about 1300 data from the website, then the application stopped saying The request was aborted: The connection was closed unexpectedly. while all your internet connection still on and on a good speed.

Community
  • 1
  • 1
didityedi
  • 171
  • 2
  • 13

2 Answers2

4

ConnectionClosed may indicate (and probably does) that the server you're downloading from is closing the connection. Perhaps it is noticing a large amount of requests from your client and is denying you additional service.

Since you can't control server-side shenanigans, I'd recommend you have some sort of logic to retry the download a bit later.

Jacob
  • 77,566
  • 24
  • 149
  • 228
  • I was thinking about that too at first, I have a couple call of the `WebClient` and I tried to run it in debug mode. The result is the next same chunk of statements (but with different content of `url` variable) can be executed. That's what made me curious. Anyway I will try your solution and test with `Thread.Sleep` with longer duration. – didityedi Feb 05 '14 at 16:52
  • I have tested again and again, it seems like the problem is really happened on the rapid connection causing the website stopped my program's `WebClient`. I will add an interval between every page and when the same exception happened. Thank you, marked as answer. – didityedi Feb 06 '14 at 17:19
  • Further investigation shows that antivirus software can cause the same problem. I was running the software again earlier this day and it returns the same error while my connection is slow and the pause using `Thread.Sleep` is conducted in every error. Shutting down antivirus for a while made the code works just fine like magic. – didityedi Feb 13 '14 at 16:15
0

Got this error because it was returned as 404 from the server.

Cătălin Rădoi
  • 1,804
  • 23
  • 43