I've seen lots of question about how to handle time out web exception on XmlReader.Create method And find out that using HttpWebRequest and set a timeout property for it would be the best answer. But the time out error still is the main problem!
After reading this link and with help of @Icepickle I closed the response and reader, and use block for response and reader:
bool GetRssHtmlElement (string rssUrl, out HtmlDocument htmlReader)
{
try
{
#region Set Request
var request = (HttpWebRequest)WebRequest.Create(rssUrl.Replace("feed://", ""));
request.Proxy = null;
request.Timeout = 120000;
request.AllowAutoRedirect = true;
request.UseDefaultCredentials = true;
request.ServicePoint.MaxIdleTime = 120000;
request.MaximumAutomaticRedirections = 10;
request.CookieContainer = new CookieContainer ();
request.ServicePoint.ConnectionLeaseTimeout = 120000;
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36";
#endregion
var response = (HttpWebResponse)request.GetResponse();
var encoding = Encoding.UTF8;
using (var reader = new StreamReader (response.GetResponseStream (), encoding))
{
var xmlSource = new XmlTextReader (reader);
xmlDoc = new XmlDocument ();
xmlDoc.Load (xmlSource);
reader.Close ();
}
response.Close();
return true;
}
catch (Exception ex)
{
//ErrorLogger.Log;
return false;
}
}
I have less "The operation has timed out." errors now but still get this error. and I can't understand why this is happening
Update: First, I collect all the news sources (like CNN, BBC, ...) from cache and for each one I run a task for each. There are about 200 sources. The Run Method is:
void Run()
{
var tempNewsSources = AllNewsSources.ToList();
NewsSourceTasks = new List<Task>();
foreach (var newsSource in tempNewsSources)
{
var tempNewsSource = newsSource;
NewsSourceTasks.Add(RunFlowsNew(tempNewsSource));
}
NewsSourceTasks.ForEach(n =>
{
n.Start();
Thread.Sleep(OneSecond);
});
}
Each source has step for reading RSS and define news link and some other steps that are about extracting news elements. RunFlowsNew method is:
Task RunFlowsNew(NewsSource PendingNews)
{
var result = new Task(() =>
{
var PendingNews = new PendingNews(newsSource);
var ExtractingNews = new ExtractingNews(newsSource);
while (IsRunPermitted())
{
var step1 = new Task(() =>
{
PendingNews.Run();
});
var step2 = new Task(() =>
{
ExtractingNews.Run(PendingNews.GetRecords());
});
//And other steps...
List<Task> stepTasks = new List<Task>() { step1, step2};
stepTasks.ForEach(n => n.Start());
PendingNews = null;
ExtractingNews = null;
GC.Collect();
});
return result;
}
PendingNews.Run() is a method calling this source's RSS:
internal void Run ()
{
PendingNewsLinkBag = new ConcurrentBag<PendingNewsLink> ();
//Read RSS
var newsNewEngineRssList = GetAllRssXmlLinksNews ();
if (newsNewEngineRssList.Any ())
AddToPendingNewsLinkToolsXml (newsNewEngineRssList);
}
And finally for each RSS I will load it and drag news' URL in a list:
void AddToPendingNewsLinkToolsXml (List<RssLink> newsRssList)
{
Parallel.ForEach (newsRssList, rssLinkRecord =>
{
XmlDocument xmlDoc;
var tempRssLink = rssLinkRecord;
var readXmlSuccess = GetRssElement(tempRssLink, out xmlDoc);
if (readXmlSuccess && xmlDoc != null)
{
try
{
var extractXmlSuccess = GetRssElementData(lastUrlLink, xmlDoc, tempRssLink.ID, out updatedLastUrlLink);
}
catch (Exception ex)
{
ErrorLogger.Log (Pending_Xml_201);
}
}
}
);
}
And finally GetRssElement is the place where pending is happening and I change it as you see befor. I even test this code for there:
bool GetRssElement (RssLink rssLinkRecord, out XmlDocument xmlDoc)
{
try
{
var client = new HttpClient();
var stream = client.GetStreamAsync(rssLinkRecord.Url.Replace("feed://", "")).Result;
using (var xmlReader = XmlReader.Create (stream))
{
xmlDoc = new XmlDocument ();
xmlDoc.Load (xmlReader);
xmlReader.Close();
}
stream.Close ();
return true;
}
catch (Exception ex)
{
ErrorLogger.Log (Pending_Xml_200);
xmlDoc = null;
return false;
}
}
Update 3 I understand that some web sites had blocked my IP, so I get a lot of time out exceptions. Is there any best practice for crawling web and news?