1

I've seen lots of question about how to handle time out web exception on XmlReader.Create method And find out that using HttpWebRequest and set a timeout property for it would be the best answer. But the time out error still is the main problem!


After reading this link and with help of @Icepickle I closed the response and reader, and use block for response and reader:

    bool GetRssHtmlElement (string rssUrl, out HtmlDocument htmlReader)
    {
        try
        {
            #region Set Request
            var request = (HttpWebRequest)WebRequest.Create(rssUrl.Replace("feed://", ""));
            request.Proxy = null;
            request.Timeout = 120000;
            request.AllowAutoRedirect = true;
            request.UseDefaultCredentials = true;
            request.ServicePoint.MaxIdleTime = 120000;
            request.MaximumAutomaticRedirections = 10;
            request.CookieContainer = new CookieContainer ();
            request.ServicePoint.ConnectionLeaseTimeout = 120000;
            request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"; 
            #endregion

            var response =  (HttpWebResponse)request.GetResponse();
            var encoding = Encoding.UTF8;
            using (var reader = new StreamReader (response.GetResponseStream (), encoding))
            {
                var xmlSource = new XmlTextReader (reader);
                xmlDoc = new XmlDocument ();
                xmlDoc.Load (xmlSource);
                reader.Close ();
            }
            response.Close();
            return true;
        }
        catch (Exception ex)
        {
            //ErrorLogger.Log;
            return false;
        }
    }

I have less "The operation has timed out." errors now but still get this error. and I can't understand why this is happening


Update: First, I collect all the news sources (like CNN, BBC, ...) from cache and for each one I run a task for each. There are about 200 sources. The Run Method is:

void Run()
    {
        var tempNewsSources = AllNewsSources.ToList();
        NewsSourceTasks = new List<Task>();
        foreach (var newsSource in tempNewsSources)
        {
            var tempNewsSource = newsSource;
            NewsSourceTasks.Add(RunFlowsNew(tempNewsSource));
        }
        NewsSourceTasks.ForEach(n =>
        {
            n.Start();
            Thread.Sleep(OneSecond);
        });
    }

Each source has step for reading RSS and define news link and some other steps that are about extracting news elements. RunFlowsNew method is:

Task RunFlowsNew(NewsSource PendingNews)
    {
        var result = new Task(() =>
        {
            var PendingNews = new PendingNews(newsSource);
            var ExtractingNews = new ExtractingNews(newsSource);
            while (IsRunPermitted())
            {
                var step1 = new Task(() =>
                {
                    PendingNews.Run();
                });
                var step2 = new Task(() =>
                {
                 ExtractingNews.Run(PendingNews.GetRecords());
                });
                //And other steps...
                List<Task> stepTasks = new List<Task>() { step1, step2};
                stepTasks.ForEach(n => n.Start());

            PendingNews = null;
            ExtractingNews = null;
            GC.Collect();

        });
        return result;
    }

PendingNews.Run() is a method calling this source's RSS:

 internal void Run ()
    {
        PendingNewsLinkBag = new ConcurrentBag<PendingNewsLink> ();

        //Read RSS
        var newsNewEngineRssList = GetAllRssXmlLinksNews ();
        if (newsNewEngineRssList.Any ())
           AddToPendingNewsLinkToolsXml (newsNewEngineRssList);
    }

And finally for each RSS I will load it and drag news' URL in a list:

 void AddToPendingNewsLinkToolsXml (List<RssLink> newsRssList)
    {
     Parallel.ForEach (newsRssList, rssLinkRecord =>
         {
             XmlDocument xmlDoc;
             var tempRssLink = rssLinkRecord;
             var readXmlSuccess = GetRssElement(tempRssLink, out xmlDoc);

             if (readXmlSuccess && xmlDoc != null)
             {
                 try
                 {
                     var extractXmlSuccess = GetRssElementData(lastUrlLink, xmlDoc, tempRssLink.ID, out updatedLastUrlLink);
                 }
                 catch (Exception ex)
                 {
                     ErrorLogger.Log (Pending_Xml_201);
                 }
             }
         }
        );
    }

And finally GetRssElement is the place where pending is happening and I change it as you see befor. I even test this code for there:

        bool GetRssElement (RssLink rssLinkRecord, out XmlDocument xmlDoc)
    {
        try
        {
            var client = new HttpClient();
            var stream = client.GetStreamAsync(rssLinkRecord.Url.Replace("feed://", "")).Result;
            using (var xmlReader = XmlReader.Create (stream))
            {
                xmlDoc = new XmlDocument ();
                xmlDoc.Load (xmlReader);
                xmlReader.Close();
            }
            stream.Close ();
            return true;
        }
        catch (Exception ex)
        {
            ErrorLogger.Log (Pending_Xml_200);
            xmlDoc = null;
            return false;
        }
    }

Update 3 I understand that some web sites had blocked my IP, so I get a lot of time out exceptions. Is there any best practice for crawling web and news?

  • It is interesting that an RSS feed takes over 5 minutes to complete, are you sure the exception is a timeout exception? – Icepickle Dec 14 '16 at 11:32
  • After reading this [link](https://blogs.msdn.microsoft.com/adarshk/2005/01/02/understanding-system-net-connection-management-and-servicepointmanager/), I'm thinking about adding these lines to my request. `request.ServicePoint.ConnectionLeaseTimeout = 5000;` `request.ServicePoint.MaxIdleTime = 5000;` – Arman Rasouli Dec 14 '16 at 11:39
  • @Icepickle It's not only Time out Exception but the main errors are about that. I save all the errors in a log document. – Arman Rasouli Dec 14 '16 at 11:42
  • That is not that surprising seeing that you do not dispose the responses, and you don't close the potential webException responses. You shoul drather wrap the response in a using block and if you catch exceptions, catch a webexception and check if the response is not null (and dispose that one preferably afterwards as well) – Icepickle Dec 14 '16 at 12:37
  • @Icepickle, I forgot to close the reader. And I don't understand if wrapping with using block can do the job. I will test it and let you know what happend. – Arman Rasouli Dec 15 '16 at 07:38
  • Well in the end it will close any open connections, potentially also the ones that stay open when an WebException is thrown, which would free up the resources from your client and it might help to decrease the resources required to do the tasks you envision for it – Icepickle Dec 15 '16 at 13:15
  • @Icepickle, I change the codes but I Stil have a lot of "The operation has timed out." error! Is there anything wrong with the codes. I have a lot of Urls and rss xmls that should be call! Is it related to he count of urls? – Arman Rasouli Dec 18 '16 at 13:30
  • 1
    You are still not handling the WebException (catching all exceptions won't help you to catch the WebException and closing/disposing the appropriate WebException.Response WebRequest). Your current sample code also doesn't really show why that many operations might be necessary and how you are calling your current code. Any chance you can update your code so we could have a better overview of what you are attempting, including numbers on how many requests in parallel would actually be opened? – Icepickle Dec 19 '16 at 07:58
  • It's all about reading news from different RSS. I updated the codes. I will be appreciate if you look at it. – Arman Rasouli Dec 19 '16 at 12:00
  • And what do you mean by handling WebException? Do you mean just get webException in a try catch block? – Arman Rasouli Dec 19 '16 at 13:05
  • Yeah, don't catch all exceptions, just catch WebException in specific because it also has a Response that theoretically needs to be closed – Icepickle Dec 19 '16 at 14:00
  • 1
    Also, another thing you could check, would be: http://stackoverflow.com/questions/4277844/multithreading-a-large-number-of-web-requests-in-c-sharp – Icepickle Dec 19 '16 at 14:05
  • Thank you for the link. That was very helpful. – Arman Rasouli Dec 21 '16 at 07:55
  • Funny thing that Exception are not WEB Exception types and it is System.IO.IOException. The operation has timed out.!!!! and it happen – Arman Rasouli Dec 25 '16 at 07:40

2 Answers2

0

How about WebClient with its async Task APIs:

        try
        {
            using(var client = new WebClient())
            {
                var task = client.DownloadStringTaskAsync(Url);                    
                if (task.Wait(300000))
                {
                    var text = new StringReader(task.Result);
                    reader = new XmlTextReader(text);
                    return true;
                }
            }
            return false;
        }
        catch (Exception ex)
        {
            return false;
        }
Bob Dust
  • 2,370
  • 1
  • 17
  • 13
0

First of all I changed GetRssElement method to this:

    XmlDocument GetRssElement (string url)
    {
        try
        {
            var httpClient = new HttpClient();
            #Region httpClientHeaders
            httpClient.DefaultRequestHeaders.AcceptLanguage.Clear ();
            httpClient.DefaultRequestHeaders.AcceptLanguage.Add (new StringWithQualityHeaderValue ("en-US"));
            httpClient.DefaultRequestHeaders.AcceptLanguage.Add (new StringWithQualityHeaderValue ("en"));
            httpClient.DefaultRequestHeaders.AcceptLanguage.Add (new StringWithQualityHeaderValue ("fa"));

            httpClient.DefaultRequestHeaders.TryAddWithoutValidation ("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36");
            httpClient.DefaultRequestHeaders.TryAddWithoutValidation ("Connection", "keep-alive");

            httpClient.Timeout = TimeSpan.FromMinutes (1000);
            var xmlDoc = new XmlDocument();
            #EndRegion
            try
            {
                var stream = httpClient.GetAsync(rssLinkRecord.Url.Replace("feed://", ""));
                xmlDoc.LoadXml (stream.Result.Content.ReadAsStringAsync ().Result);
            }
            catch (XmlException) //if xml need encoding.
            {
                var wc = new WebClient();
                var encoding = Encoding.GetEncoding("utf-8");
                var data = wc.DownloadData(rssLinkRecord.Url.Replace("feed://", ""));
                var gzip = new GZipStream(new MemoryStream(data), CompressionMode.Decompress);
                var decompressed = new MemoryStream();
                gzip.CopyTo (decompressed);
                var str = encoding.GetString(decompressed.GetBuffer(), 0, (int) decompressed.Length);
                xmlDoc = new XmlDocument ();
                xmlDoc.LoadXml (str);
            }
            return xmlDoc;
        }
        catch (TaskCanceledException ex){}
        catch (AggregateException aex){}
        catch (XmlException xexi){}
        catch (WebException wex){}
        catch (Exception ex){}
    }

I used httpClient instead of request and response and set proper header for it. I used a big TimeOut because if a request fail before a Timeout, other connection on this group will be fail. I use a try catch block to get XmlException and use specific encoding for reading XML. Finally in this method I used different catch blocks so I can understand which kind of error happens.
Second, I used

System.Net.ServicePointManager.DefaultConnectionLimit = 999999999;

this at the first line where method is calling, to Removing limits that .Net has on connections.
Finally I changed some settings on registry as this link , to Removing limits that windows servers have on connections.