retrieving partial content using multiple http requsets to fetch data via parllel tasks

Question

i am trying to be as thorough as i can in this post, as it is very important for me,

though the issue is very simple, and only by reading the title of this question, you can get the idea...

question is:

with healthy bandwidth (30mb Vdsl) available...

how is it possible to get multiple httpWebRequest for a single data / file ?,

so each reaquest,will download only a portion of the data then when all instances have completed, all parts are joined back to one piece.

Code:

...what i have got working so far is same idea only each task =HttpWebRequest = different file,

so speedup is pure tasks parallelism rather acceleration of one download using multiple tasks/threads

as in my question.

see code below

the next part is only more detailed explantion and background on the subject...if you don't mind reading.

while i am still on a similar project that differ from this (in question)one,

in the way that it(see code below..) was trying to fetch as many different data sources for each of separated tasks(different downloads/files). ... so the speedup was gaind while each(task) does not have to wait for the former one to complete first before it get a chance to be executed .

what i am trying to do in this current-subjected question (having allmost everything ready in the code below) is actually targetting same url for same data, so this time the speedup to gain is for the single-task - current download .

implementing same idea as in code below only this time let SmartWebClient target same url by using multiple instances.

then (only theory for now) it will request partial content of data, with multiple requests with each one of instances .

last issue is i need to "put puzle back to one peace"... another problem i need to find out about...

as you can see in this code , what i did not get to work on yet is only the data parsing/processing which i find to be very easy using htmlAgilityPack so no problem.

current code

main entry:

        var htmlDictionary = urlsForExtraction.urlsConcrDict();
        Parallel.ForEach(
                        urlList.Values,
                        new ParallelOptions { MaxDegreeOfParallelism = 20 },
                        url => Download(url, htmlDictionary)
                        );
        foreach (var pair in htmlDictionary)
        {
            ///Process(pair);
            MessageBox.Show(pair.Value);
        }

public class urlsForExtraction
{
        const string URL_Dollar= "";
        const string URL_UpdateUsersTimeOut="";


        public ConcurrentDictionary<string, string> urlsConcrDict()
        {
            //need to find the syntax to extract fileds names so it would be possible to iterate on each instead of specying
            ConcurrentDictionary<string, string> retDict = new Dictionary<string,string>();
            retDict.TryAdd("URL_Dollar", "Any.Url.com");
            retDict.TryAdd("URL_UpdateUserstbl", "http://bing.com");
            return retDict;
        }


}


/// <summary>
/// second Stage Class consumes the Dictionary of urls for extraction
/// then downloads Each via parallel for each using The Smart WeBClient! (download(); )
/// </summary>
public class InitConcurentHtmDictExtrct
{

    private void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
    {

        using (var webClient = new SmartWebClient())
        {
            webClient.Encoding = Encoding.GetEncoding("UTF-8");
            webClient.Proxy = null;
            htmlDictionary.TryAdd(url, webClient.DownloadString(url));
        }
    }

    private ConcurrentDictionary<string, string> htmlDictionary;
    public ConcurrentDictionary<string, string> LoopOnUrlsVia_SmartWC(Dictionary<string, string> urlList)
    {

        htmlDictionary = new ConcurrentDictionary<string, string>();
        Parallel.ForEach(
                        urlList.Values,
                        new ParallelOptions { MaxDegreeOfParallelism = 20 },
                        url => Download(url, htmlDictionary)
                        );
        return htmlDictionary;

    }
}
/// <summary>
/// the Extraction Process, done via "HtmlAgility pack" 
/// easy usage to collect information within a given html Documnet via referencing elements attributes
/// </summary>
public class Results
{
    public struct ExtracionParameters
    {
        public string FileNameToSave;
        public string directoryPath;
        public string htmlElementType;

    }
    public enum Extraction
    {
        ById, ByClassName, ByElementName
    }
    public void ExtractHtmlDict( ConcurrentDictionary<string, string> htmlResults, Extract By)
    {
        // helps with easy  elements extraction from the page.
        HtmlAttribute htAgPcAttrbs;
        HtmlDocument HtmlAgPCDoc = new HtmlDocument();
        /// will hold a name+content of each documnet-part that was aventually extracted 
        /// then from this container the build of the result page will be possible
        Dictionary<string, HtmlDocument> dictResults = new Dictionary<string, HtmlDocument>();

        foreach (KeyValuePair<string, string> htmlPair in htmlResults)
        {
            Process(htmlPair);
        }
    }
    private static void Process(KeyValuePair<string, string> pair)
    {
        // do the html processing
    }

}
public class SmartWebClient : WebClient
{


    private readonly int maxConcurentConnectionCount;

    public SmartWebClient(int maxConcurentConnectionCount = 20)
    {
        this.Proxy = null;
        this.Encoding = Encoding.GetEncoding("UTF-8");
        this.maxConcurentConnectionCount = maxConcurentConnectionCount;
    }

    protected override WebRequest GetWebRequest(Uri address)
    {
        var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
        if (httpWebRequest == null)
        {
            return null;
        }

        if (maxConcurentConnectionCount != 0)
        {
            httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
        }

        return httpWebRequest;
    }

}
}

this allows me to take advantage of good bandwith, only i am far from the subjected solution, i will realy appriciate any clue on where to start .

score 2 · Accepted Answer · edited May 23 '17 at 10:32

If the server support what's wikipedia calls byte serving, you can multiplex a file download spawning multiple requests with a specific Range header value (using the AddRange method. See also How to download the data from the server discontinuously？). Most serious HTTP servers do support byte-range.

Here is some sample code that implements a parallel download of a file using byte range:

    public static void ParallelDownloadFile(string uri, string filePath, int chunkSize)
    {
        if (uri == null)
            throw new ArgumentNullException("uri");

        // determine file size first
        long size = GetFileSize(uri);

        using (FileStream file = new FileStream(filePath, FileMode.Create, FileAccess.Write, FileShare.Write))
        {
            file.SetLength(size); // set the length first

            object syncObject = new object(); // synchronize file writes
            Parallel.ForEach(LongRange(0, 1 + size / chunkSize), (start) =>
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
                request.AddRange(start * chunkSize, start * chunkSize + chunkSize - 1);
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();

                lock (syncObject)
                {
                    using (Stream stream = response.GetResponseStream())
                    {
                        file.Seek(start * chunkSize, SeekOrigin.Begin);
                        stream.CopyTo(file);
                    }
                }
            });
        }
    }

    public static long GetFileSize(string uri)
    {
        if (uri == null)
            throw new ArgumentNullException("uri");

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
        request.Method = "HEAD";
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        return response.ContentLength;
    }

    private static IEnumerable<long> LongRange(long start, long count)
    {
        long i = 0;
        while (true)
        {
            if (i >= count)
            {
                yield break;
            }
            yield return start + i;
            i++;
        }
    }

And sample usage:

    private static void TestParallelDownload()
    {
        string uri = "http://localhost/welcome.png";
        string fileName = Path.GetFileName(uri);

        ParallelDownloadFile(uri, fileName, 10000);
    }

PS: I'd be curious to know if it's really more interesting to do this parallel thing rather than to just use WebClient.DownloadFile... Maybe in slow network scenarios?

hi simon, i was trying to implement the whole concept of `AddRange` but all samples of code i've found wasn't talking about how to split ranges of request/header/data into sections as a kinda `request array` and join it back when it's completed. by stating in my bouny description , i was asking for a sample code so please, if you will have a little time to create a few lines of code that will show how would you use a httpWebRequest to dowmload splits/sections/portions of data (if i could use with multithreading/or via Task parallel class any of which is sutible) , then show how to put it back. — LoneXcoder, Dec 03 '12 at 09:58
i have no clue, thought it may have effect anyway cause some of (i think most of) the servers that you request: say a file to download, will only let u use a portion of bandwidth they have so if for example the bandWidth per user = 100kbps , then multiple clients * 100kbps should be your speedup , in the aspect of thereading i was not also realy smart about it cause i dont realy know if threading or multiple task /threads each - for one of your HttpWebRequest, will result in any effect as a speed up . i do think it will have for the multiple instances and the usage of partial data paral — LoneXcoder, Dec 03 '12 at 11:29
didn't have the time to test it yet, i'm sure it works (: just wanted to comment when reading again, your code looks great, well explaind, tidy although the task is(for my little experience) a bit complex, its still look like somthing easy 2learn if i'll break it to parts.. thanks to your coding skills with comments..etc'. by the way, as you refered to parallel, if by "Parallel" you mean curius about the usage of TPL or you were refering to usage via multiple segments(which i think you did) then i do think it'll make benefits(implementing a `resume` is an option)otherwise not(?) — LoneXcoder, Dec 04 '12 at 14:45
@LoneXcoder - yes by parallel, I meant multiple segments. I guess it's interesting when the network is slow or unreliable, as it's easier to transfer a 10K chunk than say a big 1G file (or maybe it's in fact the only solution). Plus, if a segment fails, we could imagine to implement a retry. This is left as an exercise :-) — Simon Mourier, Dec 04 '12 at 16:27
anything you Say,as this comment list grows.. my mail is rbanay@gmail.com, incase you want to inform me in any further investigations you might have the time to tackle, ill be very happy to have you in my just incase somthing interesting is going on or if i have somting really important to ask, some rare ocations to have a *big developer-brother* — LoneXcoder, Dec 04 '12 at 18:02

retrieving partial content using multiple http requsets to fetch data via parllel tasks

current code

1 Answers1