6

I am trying to download a large file (>1GB) from one server to another using HTTP. To do this I am making HTTP range requests in parallel. This lets me download the file in parallel.

When saving to disk I am taking each response stream, opening the same file as a file stream, seeking to the range I want and then writing.

However I find that all but one of my response streams times out. It looks like the disk I/O cannot keep up with the network I/O. However, if I do the same thing but have each thread write to a separate file it works fine.

For reference, here is my code writing to the same file:

int numberOfStreams = 4;
List<Tuple<int, int>> ranges = new List<Tuple<int, int>>();
string fileName = @"C:\MyCoolFile.txt";
//List populated here
Parallel.For(0, numberOfStreams, (index, state) =>
{
    try
    {
        HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("Some URL");
        using(Stream responseStream = webRequest.GetResponse().GetResponseStream())
        {
            using (FileStream fileStream = File.Open(fileName, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Write))
            {
                fileStream.Seek(ranges[index].Item1, SeekOrigin.Begin);
                byte[] buffer = new byte[64 * 1024];
                int bytesRead;
                while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    if (state.IsStopped)
                    {
                        return;
                    }
                    fileStream.Write(buffer, 0, bytesRead);
                }
            }
        };
    }
    catch (Exception e)
    {
        exception = e;
        state.Stop();
    }
});

And here is the code writing to multiple files:

int numberOfStreams = 4;
List<Tuple<int, int>> ranges = new List<Tuple<int, int>>();
string fileName = @"C:\MyCoolFile.txt";
//List populated here
Parallel.For(0, numberOfStreams, (index, state) =>
{
    try
    {
        HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("Some URL");
        using(Stream responseStream = webRequest.GetResponse().GetResponseStream())
        {
            using (FileStream fileStream = File.Open(fileName + "." + index + ".tmp", FileMode.OpenOrCreate, FileAccess.Write, FileShare.Write))
            {
                fileStream.Seek(ranges[index].Item1, SeekOrigin.Begin);
                byte[] buffer = new byte[64 * 1024];
                int bytesRead;
                while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    if (state.IsStopped)
                    {
                        return;
                    }
                    fileStream.Write(buffer, 0, bytesRead);
                }
            }
        };
    }
    catch (Exception e)
    {
        exception = e;
        state.Stop();
    }
});

My question is this, is there some additional checks/actions that C#/Windows takes when writing to a single file from multiple threads that would cause the file I/O to be slower than when writing to multiple files? All disk operations should be bound by the disk speed right? Can anyone explain this behavior?

Thanks in advance!

UPDATE: Here is the error the source server is throwing:

"Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." [System.IO.IOException]: "Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." InnerException: "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond" Message: "Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." StackTrace: " at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)\r\n at System.Net.Security._SslStream.StartWriting(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)\r\n at System.Net.Security._SslStream.ProcessWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)\r\n at System.Net.Security.SslStream.Write(Byte[] buffer, Int32 offset, Int32 count)\r\n

shortspider
  • 1,045
  • 15
  • 34
  • the only thing that I can see that might cause the writing to a single file to become and or appear sluggish is that you are not flushing the file after each call to `fileStream.Write(buffer, 0, bytesRead);` – MethodMan Jul 31 '15 at 17:41
  • Don't open the same file in each thread. Open it once and use that single instance(make sure more than one thread doesn't write at the same time - you can use *lock* for it) – EZI Jul 31 '15 at 17:51
  • This should work. Post the exception ToString. How fast is the network and how big your timeout? (Note, that Parallel.For is unsuitable because it uses an uncontrollable degree of parallelism. You can only specify a maximum.) – usr Jul 31 '15 at 17:53
  • @usr the server I am getting the file from is what throws the exception. It states 'SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'. I'm taking this to mean that the receiver is not reading bytes from the stream fast enough. – shortspider Jul 31 '15 at 18:05
  • @shortspider It generally means you try to connect to a non-existing machine. (If it existed, you would get, either a successful connection or a kind of *connection rejected* message in a very short time) – EZI Jul 31 '15 at 18:09
  • Post the exception ToString which includes the stack. Also test my answer by writing to a presized existing file. – usr Jul 31 '15 at 18:10
  • @EZI the source machine streams the file into the network stream, it goes for a bit and then throws the exception. If I run 4 threads, 3 will throw the exception and one will continue and complete it's part of the file. – shortspider Jul 31 '15 at 18:39
  • @shortspider BTW: check the `System.Net.ServicePointManager.DefaultConnectionLimit` http://blogs.msdn.com/b/jpsanders/archive/2009/05/20/understanding-maxservicepointidletime-and-defaultconnectionlimit.aspx – EZI Jul 31 '15 at 18:49
  • @EZI its set to 64. I'm doing this between two test servers so I don't think that's the issue. – shortspider Jul 31 '15 at 19:03
  • @shortspider 64 is good. (For ex, my machine's default is 2. That would mean my other threads would only wait for a free connection) – EZI Jul 31 '15 at 19:09
  • @usr error has been added. – shortspider Jul 31 '15 at 19:28
  • @shortspider OK, all existing answers do not apply anymore. This is a server-side problem. The *write* to the server timed out. Any idea what could cause the server to not accept a network write? – usr Jul 31 '15 at 19:35
  • @usr that exception is posted to you 1 hr ago. :) See the comments... – EZI Jul 31 '15 at 19:41
  • @EZI but now the stack shows that it is a write. That's why I always collect the full ToString output. – usr Jul 31 '15 at 19:42
  • @usr I didn't need it since I knew It is a generic problem. But since you were insistent on it I am sure you'll post the correct answer with this info..... Come on, you were so sure about your answer that you didn't even read that comment carefully. – EZI Jul 31 '15 at 19:43
  • @EZI OK, I actually forgot. – usr Jul 31 '15 at 19:47

5 Answers5

4

Unless you're writing to a striped RAID, you're unlikely to experience performance benefits by writing to the file from multiple threads concurrently. In fact, it's more likely to be the opposite – the concurrent writes would get interleaved and cause random access, incurring disk seek latencies that makes them orders of magnitude slower than large sequential writes.

To get a sense of perspective, look at some latency comparisons. A sequential 1 MB read from disk takes 20 ms; writes take approximately the same time. Each disk seek, on the other hands, takes around 10 ms. If your writes are interleaved at 4 KB chunks, then your 1 MB write will require an additional 2560 ms of seek time, making it 100 times slower than sequential.

I would suggest only allowing one thread to write to the file at any time, and use parallelism just for the network transfer. You can use a producer–consumer pattern where downloaded chunks are written to a bounded concurrent collection (such as BlockingCollection<T>), which then get picked up and written to disk by a dedicated thread.

Douglas
  • 53,759
  • 13
  • 140
  • 188
  • But why does this block? – usr Jul 31 '15 at 17:57
  • @usr: If the granularity of the write-interleaving is fine enough, the slowdown could be orders of magnitude, making the writing appear like it's blocking when it's actually extremely slow. – Douglas Jul 31 '15 at 18:06
  • @Douglas so I modified my code to create the file as all zeros first. When I then run all threads I lock around the file write. I am still getting the same error however. – shortspider Jul 31 '15 at 20:33
  • This answer doesn't consider write caching (at the file system or block device layer), or native command queueing at the disk. – Jonathon Reinhart Aug 14 '15 at 10:08
  • @JonathonReinhart: Valid observation. My calculation presented a worst-case analysis that ignores those mitigating optimizations. However, my point that random writes are orders of magnitude slower over large streams of data still stands. Disk buffers are typically as small as 32MB, so they get filled up quickly, still leading to disk seeks when processing the threads' buffered streams of data. – Douglas Aug 15 '15 at 04:48
2
    fileStream.Seek(ranges[index].Item1, SeekOrigin.Begin);

That Seek() call is a problem, you'll seek to a part of the file that's very far removed from the current end-of-file. Your next fileStream.Write() call forces the file system to extend the file on disk, filling the unwritten parts of it with zeros.

This can take a while, your thread will be blocked until the file system is done extending the file. Might well be long enough to trigger a timeout. You'd see this go wrong early at the start of the transfer.

A workaround is to create and fill the entire file before you start writing real data. Otherwise a very common strategy used by downloaders, you might have seen .part files before. Another nice benefit is that you have a decent guarantee that the transfer cannot fail because the disk ran out of space. Beware that filling a file with zeros is only cheap when the machine has enough RAM. 1 GB should not be a problem on modern machines.

Repro code:

using System;
using System.IO;
using System.Diagnostics;

class Program {
    static void Main(string[] args) {
        string path = @"c:\temp\test.bin";
        var fs = new FileStream(path, FileMode.Create, FileAccess.Write, FileShare.Write);
        fs.Seek(1024L * 1024 * 1024, SeekOrigin.Begin);
        var buf = new byte[4096];
        var sw = Stopwatch.StartNew();
        fs.Write(buf, 0, buf.Length);
        sw.Stop();
        Console.WriteLine("Writing 4096 bytes took {0} milliseconds", sw.ElapsedMilliseconds);
        Console.ReadKey();
        fs.Close();
        File.Delete(path);
    }
}

Output:

Writing 4096 bytes took 1491 milliseconds

That was on an fast SSD, a spindle drive is going to take much longer.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • Tried this, didn't work unfortunately. I created the file outside the Parallel.For and seeked until the end. Still had the same error. – shortspider Jul 31 '15 at 19:04
  • Don't just seek, that just recreates the original problem. You have to actually call FileStream.Write(). What you write doesn't matter. – Hans Passant Jul 31 '15 at 19:08
  • No, that doesn't work either, same problem. The NTFS file system handles "sparse" files too well, the size of a file doesn't have to match the actual number of bytes on disk. – Hans Passant Jul 31 '15 at 19:14
  • @HansPassant sorry I called seek and then WriteByte, is that OK? – shortspider Jul 31 '15 at 19:30
1

Here's my guess from the information given so far:

On Windows, when you write to a position that extends the file size Windows needs to zero initialize everything that comes before it. This prevents old disk data to leak which would be a security problem.

Probably, all but your first thread need to zero-init so much data that the download times out. This is not really streaming anymore because the first write takes ages.

If you have the LPIM privilege you can avoid zero initialization. Otherwise you cannot for security reasons. Free Download Manager shows a message that it starts zero-initing at the start of each download.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Tried this and it didn't work. I followed the suggestion by @Hans Passant and though a zeroed file was created the multi-thread write still failed. – shortspider Jul 31 '15 at 19:06
1

So after trying all the suggestions I ended up using a MemoryMappedFile and openening a stream to write to the MemoryMappedFile on each thread:

int numberOfStreams = 4;
List<Tuple<int, int>> ranges = new List<Tuple<int, int>>();
string fileName = @"C:\MyCoolFile.txt";
//Ranges list populated here
using (MemoryMappedFile mmf = MemoryMappedFile.CreateFromFile(fileName, FileMode.OpenOrCreate, null, fileSize.Value, MemoryMappedFileAccess.ReadWrite))
{
    Parallel.For(0, numberOfStreams, index =>
    {
        try
        {
            HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("Some URL");
            using(Stream responseStream = webRequest.GetResponse().GetResponseStream())
            {
                using (MemoryMappedViewStream fileStream = mmf.CreateViewStream(ranges[index].Item1, ranges[index].Item2 - ranges[index].Item1 + 1, MemoryMappedFileAccess.Write))
                {
                    responseStream.CopyTo(fileStream);
                }
            };
        }
        catch (Exception e)
        {
            exception = e;
        }
    });
}
shortspider
  • 1,045
  • 15
  • 34
0

System.Net.Sockets.NetworkStream.Write

The stack trace shows that the errors happens when writing to the server. It is a timeout. This can happen because of

  1. network failure/overloading
  2. an unresponsive server.

This is not an issue with writing to a file. Analyze the network and the server. Maybe the server is not ready for concurrent usage.

Prove this theory by disabling writing to the file. The error should remain.

usr
  • 168,620
  • 35
  • 240
  • 369
  • 3. Non-existing server like `telnet 1.2.3.4`. Since with an existing server, you most probably wouldn't get time-out exception. An instant *connection rejected* is a more expected response. – EZI Jul 31 '15 at 20:04
  • @EZI the stack shows (hah!) that this is a write after the connect has already happened (`SslStream.Write`). The connection is there. – usr Jul 31 '15 at 20:07
  • usr, :) You may be right this time. I am not so sure about it. – EZI Jul 31 '15 at 20:19
  • @usr so I have RDPd into both servers and have attached visual studio to both process. When I start the download I can see two requests go from the destination server to the source. I can see the source respond. After a few seconds one of the connections (I am using two threads) on the source server will throw that exception. The other continues just fine. On the destination server I am left with half the file + a bit extra. I do not think it is a network issue. – shortspider Jul 31 '15 at 20:30
  • @shortspider OK create one more test case as in my first comment. Just lock the filestream while seeking+writing. – EZI Jul 31 '15 at 20:39