11

I would like to chain multiple stream operations (like downloading a file, uncompressing it on the fly, and processing the data without any temp files). The files are in 7z format. There is a LZMA SDK available, but forces me to create an outside output stream instead of being a stream itself - in other words the output stream will have to be fully written before I can work with it. SevenZipSharp also seems to be missing this functionality.

Has anyone done something like that?

// in pseudo-code - CompressedFileStream derives from Stream
foreach (CompressedFileStream f in SevenZip.UncompressFiles(Web.GetStreamFromWeb(url))
{
    Console.WriteLine("Processing file {0}", f.Name);
    ProcessStream( f ); // further streaming, like decoding, processing, etc
}

Each file stream would behave like a read-once stream representing one file, and calling MoveNext() on the main compressed stream would automatically invalidate & skip that file.

Similar constructs can be done for compression. Example usage - do some aggregation on a very large quantity of data - for each 7z file in a dir, for each file inside, for each data line in each file, sum up some value.

UPDATE 2012-01-06

#ziplib (SharpZipLib) already does exactly what I need for zip files with ZipInputStream class. Here is an example that yields all files as unseekable streams inside a given zip file. Still looking for 7z solution.

IEnumerable<Stream> UnZipStream(Stream stream)
{
    using (var zipStream = new ZipInputStream(stream))
    {
        ZipEntry entry;
        while ((entry = zipStream.GetNextEntry()) != null)
            if (entry.IsFile)
                yield return zipStream;
    }
}
Yuri Astrakhan
  • 8,808
  • 6
  • 63
  • 97

1 Answers1

0

The underlying algorithm and parameters specified at the time of compression determine the size of chunks used and there is no way to ensure that as you decode chunks, they fall at word / line boundaries. So, you will have to completely decompress a file before processing.

What you are asking to do is probably not possible without temp files - what it really depends on is whether you have sufficient memory to keep the decompressed file open via a MemoryStream, perform all your processing and then release the memory back to the pool. Further complicating this is the fragmentation (of process memory) that you could cause doing this repeatedly.

Vijay Varadan
  • 629
  • 5
  • 18
  • I'm not sure I understand what you mean by word/line boundaries. The `CompressedFileStream` object is returned the moment SevenZip receives file header from the stream, not after getting the whole file. Reading decompressed file's data causes the source stream to advance as well. – Yuri Astrakhan Jan 05 '12 at 23:56