Retrieve a string containing html Document source using Task parallel

Question

I really hope there's someone experienced enough both with TPL & System.Net Classes and methods

What started as a simple thought of use TPL on current sequential set of actions led me to a halt in my project.

As I am still fresh With .NET, jumping straight to deep water using TPL ...

I was trying to extract an Aspx page's source/content(html) using WebClient

Having multiple requests per day (around 20-30 pages to go through) and extract specific values out of the source code... being only one of few daily tasks the server has on its list,

Led me to try implement it by using TPL, thus gain some speed.

Although I tried using Task.Factory.StartNew() trying to iterate on few WC instances , on first try execution of WC the application just does not get any result from the WebClient

This is my last try on it

    static void Main(string[] args)
    {
        EnumForEach<Act>(Execute);
        Task.WaitAll();
    }

    public static void EnumForEach<Mode>(Action<Mode> Exec)
    {
            foreach (Mode mode in Enum.GetValues(typeof(Mode)))
            {
                Mode Curr = mode;

                Task.Factory.StartNew(() => Exec(Curr) );
            }
    }

    string ResultsDirectory = Environment.CurrentDirectory,
        URL = "",
        TempSourceDocExcracted ="",
        ResultFile="";

        enum Act
        {
            dolar, ValidateTimeOut
        }

    void Execute(Act Exc)
    {
        switch (Exc)
        {
            case Act.dolar:
                URL = "http://www.AnyDomainHere.Com";
                ResultFile =ResultsDirectory + "\\TempHtm.htm";
                TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
                File.WriteAllText(ResultFile, TempSourceDocExcracted);
                break;
            case Act.ValidateTimeOut:
                URL = "http://www.AnotherDomainHere.Com";
                ResultFile += "\\TempHtm.htm";
                TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
                File.WriteAllText(ResultFile, TempSourceDocExcracted);
                break;
        }

        //usage of HtmlAgilityPack to extract Values of elements by their attributes/properties
        public HtmlAgilityPack.HtmlDocument AgilityPacDocExtraction(string URL)
        {
            using (WC = new WebClient())
            {
                WC.Proxy = null;
                WC.Encoding = Encoding.GetEncoding("UTF-8");
                tmpExtractedPageValue = WC.DownloadString(URL);
                retAglPacHtmDoc.LoadHtml(tmpExtractedPageValue);
                return retAglPacHtmDoc;
            }
        }

What am I doing wrong? Is it possible to use a WebClient using TPL at all or should I use another tool (not being able to use IIS 7 / .net4.5)?

score 3 · Accepted Answer · edited Nov 24 '12 at 00:02

I see at least several issues:

naming - FlNm is not a name - VisualStudio is modern IDE with smart code completion, there's no need to save keystrokes (you may start here, there are alternatives too, main thing is too keep it consistent: C# Coding Conventions.
If you're using multithreading, you need to care about resource sharing. For example FlNm is a static string and it is assigned inside each thread, so it's value is not deterministic (also even if it was running sequentially, code would work faulty - you would adding file name in path in each iteration, so it would be like c:\TempHtm.htm\TempHtm.htm\TempHtm.htm)
You're writing to the same file from different threads (well, at least that was your intent I think) - usually that's a recipe for disaster in multithreading. Question is, if you need at all write anything to disk, or it can be downloaded as string and parsed without touching disk - there's a good example what does it mean to touch a disk.
Overall I think you should parallelize only downloading, so do not involve HtmlAgilityPack in multithreading, as I think you don't know it is thread safe. On the other hand, downloading will have good performance/thread count ratio, html parsing - not so much, may be if thread count will be equal to cores count, but not more. Even more - I would separate downloading and parsing, as it would be easier to test, understand and maintain.

Update: I don't understand your full intent, but this may help you started (it's not production code, you should add retry/error catching, etc.). Also at the end is extended WebClient class allowing you to get more threads spinning, because by default webclient allows only two connections.

class Program
{
    static void Main(string[] args)
    {
        var urlList = new List<string>
                          {
                              "http://google.com",
                              "http://yahoo.com",
                              "http://bing.com",
                              "http://ask.com"
                          };

        var htmlDictionary = new ConcurrentDictionary<string, string>();
        Parallel.ForEach(urlList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, url => Download(url, htmlDictionary));
        foreach (var pair in htmlDictionary)
        {
            Process(pair);
        }
    }

    private static void Process(KeyValuePair<string, string> pair)
    {
        // do the html processing
    }

    private static void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
    {
        using (var webClient = new SmartWebClient())
        {
            htmlDictionary.TryAdd(url, webClient.DownloadString(url));
        }
    }
}

public class SmartWebClient : WebClient
{
    private readonly int maxConcurentConnectionCount;

    public SmartWebClient(int maxConcurentConnectionCount = 20)
    {
        this.maxConcurentConnectionCount = maxConcurentConnectionCount;
    }

    protected override WebRequest GetWebRequest(Uri address)
    {
        var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
        if (httpWebRequest == null)
        {
            return null;
        }

        if (maxConcurentConnectionCount != 0)
        {
            httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
        }

        return httpWebRequest;
    }
}

As i lounch the current application (using Console App), main method is static so in order for executed method and its variables to be non static i should use a separeted calss and instenciatet it . (solves number 2 on your list ? ) — LoneXcoder, Nov 22 '12 at 14:52
well, even if it would be not static, if it is modified by multiple threads, it is a problem. I would suggest to use some Concurrent collection for saving results from threads. — Giedrius, Nov 22 '12 at 14:57
yes , i will , last code version will collect in a `dictionary ` though what about instances of `WebClient` (multiple around say ...10 and..in parallel) taking some time to start up .. and get results . this is **first** stop as i **start to have problems** — LoneXcoder, Nov 22 '12 at 15:11
I've updated answer with simple sample, hope it will get you started, because frankly from your code is very hard to understand what you're trying to do and what you need to do. — Giedrius, Nov 22 '12 at 15:28
Wow , that looks Great as i understand allmost nothing of what going on there i am going to test each part , i will return to you on this , thanks A Million !! new Project is GiedriusParralelSmartWC — LoneXcoder, Nov 22 '12 at 15:32
from first run it worked (usually answers help get the idea.. but not allways run(:...) what i wanted to tell you is that this portion of code (that i just began testing )you have posted will be a kind of Corner stone To my TPL Career ! you have somthing (big) to do with me being able to get in to the era of TPL no hussle. Thanks a lot Mr Giedrius — LoneXcoder, Nov 22 '12 at 15:58

Retrieve a string containing html Document source using Task parallel

1 Answers1

Linked