1

We're getting some very large .tar.gz files that we want to load into SQL Server (like 3 gig gzipped, 20 gig uncompressed).

I thought I would try to save both some time and disk space by writing a C# layer to decompress and stream to SQL Server on the fly using named pipes.

I found this: Can Sql Server BULK INSERT read from a named pipe/fifo?

and both answers had some "we don't know why this is necessary but it doesn't work otherwise" comments

  1. the need to open 2 pipes and
  2. the need to do it all in one big write to SQL Server

It's not really practical to do 1 big 20 gig write, so I was tinkering with the code in the above solutions.

What I found was that when I

byte[] buff = new byte[<pick your size here>];
int readlen;

while ((readlen = inputStream.Read(buff, 0, buff.Length)) > 0)
{
    pipeStream.Write(buff, 0, readlen);
    pipeStream.WaitForPipeDrain();
}

I get 2 writes in and I get a "Pipe is broken" error on the third pretty much no matter what the buffer size is.

Originally I tried

inputStream.CopyTo(pipeStream);

and by default that uses an 80k buffer. Looking at Position, I could see that I'd be on the third write when it blew up.

Any idea what might be happening on the SQL Server side to break off after 2 reads?

EDIT: I did some playing around and found that with larger buffer sizes it would get farther. When I got above 1 meg buffers, I got almost up to 2 gig written to the pipe before it broke. Going above 5 meg buffers didn't improve things.

At first I thought the overhead of decompressing the zip might have had the reads lagging the writes, so Sql Server might have closed on a timed out read or something. I made the reads and writes async so I could get the next read going while the write was happening, but that didn't help.

I added some timing, and it turned out the pipe writes were taking 2.5 times as much time as the gzip reads, so I don't know why the pipe ultimately breaks.

Still almost 2 gig was a lot farther that 2-3 writes of a few hundred K

user1664043
  • 695
  • 5
  • 14
  • Have you considered using the `SqlBulkCopy` class – Charlieface May 31 '21 at 23:13
  • Thanks for the reply @Charlieface. In other code spots where the data is being built up in the C# I've used SqlBulkCopy, but in this instance we are getting a very large csv with a format file. I figured if I could get the named pipe to work to Sql Server with minimal code intervention I might not add a lot of overhead (I was hoping for 15-20% overhead to save the unpacking step) while getting the native functionality in Sql Server going. I doubt I could keep the overhead down if I was also having to parse the stream and manage the intermediary datatables for SqlBulkCopy in C#. – user1664043 Jun 01 '21 at 13:16
  • I've actually experimented with this myself but the short version is that this cannot be made to work reliably. The problem is that the engine assumes your named pipe has the characteristics of a file, meaning you cannot *ever* fall behind even a single byte when feeding the engine with data, or the pipe operation will break. And since you cannot control the buffering characteristics of the engine, this is pretty much impossible to guarantee -- it may issue a read faster than you can write, no matter how big your buffer is, unless you write everything in one go, which defeats the purpose. – Jeroen Mostert Jun 03 '21 at 17:36
  • Thanks for taking the time to respond, Jeroen. I was hoping someone would have some other trick I hadn't thought of, but I was coming to the same conclusion, alas. – user1664043 Jun 03 '21 at 21:15
  • Note that one thing I do see you mention that leads to unnecessary inefficiencies: "manage the intermediary datatables for SqlBulkCopy". `DataTable` is the most convenient but least efficient way to drive `SqlBulkCopy`; you can give it either an `IDataReader` (there are plenty of approaches out there to convert objects to `IDataReader` or an `IEnumerable`, both of which allow streaming and avoid the massive memory use of `DataTable`. It won't be as fast as a server-side `BULK INSERT`, but it's typically pretty darn fast. – Jeroen Mostert Jun 04 '21 at 07:08
  • Just for clarification... I've seen IEnumerable when passing TVPs and I'm not seeing any SqlBulkCopy method that takes IEnumerable (just DataRow[]). Was there an implicit switch to TVPs there? – user1664043 Jun 04 '21 at 14:38
  • Yes, my bad -- an `IDataReader` *is* an `IDataRecord` (representing one row in an iteration), meaning you can trivially implement it by wrapping an enumeration. You can't directly pass such an enumeration to `WriteToServer`, that's only TVPs. The main advice is to not use `DataTable`. – Jeroen Mostert Jun 05 '21 at 14:15
  • I implemented a data reader, and just for testing I first tried to do DataTable.Load(datareader). But the csv I'd been loading had a bunch of columns I didn't want, so I was using OPENROWSET() to pick and choose, dropping the others. Turns out DataTable.Load() requires .GetSchemaTable() to be implemented, but there really isn't documentation on what you should fill it with. Trial and error until I found the required columns. Then rather than DataTable doing any mapping between columns it's got and the datareader, it just tries to use GetValues() on the reader to fetch all columns – user1664043 Jun 10 '21 at 22:08
  • So I went ahead and tried plugging it into SqlBulkCopy instead, and bulk copy only tries to fetch the columns bulk copy is configured for - *but* it doesn't call the reader method that goes with the column type. For simplicity, since I was reading a text file, I just made all the columns typeof(String) and then implemented all the GetInt() type methods as int.Parse(substring). Didn't want to do String.Split for all the columns I wasn't going to use. But the GetInt() wasn't called - only GetValue() – user1664043 Jun 10 '21 at 22:10
  • After running some timing tests, loading 150 million row (20 gig of data) with OPENROWSET took about 10 minutes; using a gzip/tar reader/SqlBulkCopy implementation took a little under 18. So about a 75-80% overhead to stream it that way. If you add in time unpacking the zip before OPENROWSET, it's about a 50% overhead. So the only real savings is on the unpacked disc space – user1664043 Jun 13 '21 at 15:14

0 Answers0