0

I have an Avro.snz file whose avro.codecs is snappy This can be opened with com.databricks.avro in Spark but it seems snappy is unsupported by Apache.Avro and Confluent.Avro, they only have deflate and null. Although they can get me the Schema, I cannot get at the data.

The next method gets and error. Ironsnappy is unable to decompress the file too, it says the input is

using (Avro.File.IFileReader<generic> reader = Avro.File.DataFileReader<generic>.OpenReader(avro_path))
{
    schema = reader.GetSchema();
    Console.WriteLine(reader.HasNext()); //true
    var hi = reader.Next(); // error
    Console.WriteLine(hi.ElementAt(0).ToString()); // error
}

I'm starting to wonder if there is anything in the Azure HDInsight library, but I cant seem to find the nuget package that gives me a way to read Avro with support for Snappy compression.

I'm open to any solution, even if that means downloading the source for Apache.Avro and adding in Snappy support manually, but to be honest, I'm sort of a newbie and have no idea how compression even works let alone add support to a library.

Can anyone help?

Update: Just adding the snappy codec to Apache.Avro and changing the DeflateStream to Ironsnappy stream failed. It gave Corrupt input again. Is there anything anywhere that can open Snappy compressed Avro files with C#?

Or how do I determine what part of the Avro is snappy compressed and pass that to Ironsnappy.

user4157124
  • 2,809
  • 13
  • 27
  • 42
Ranald Fong
  • 401
  • 3
  • 12

2 Answers2

2

Ok, so not even any comments on this. But I eventually solved my problem. Here is how I solved it.

  1. I tried Apache.Avro and Confluent version as well, but their .net version has no snappy support darn. But I can get the schema as that is uncompressed apparently.
  2. Since Parquet.Net uses IronSnappy, I built/added out the snappy codec in Apache.Avro by basically cloning its deflate code and changing a few names. Failed. Corrupt input Ironsnappy says.
  3. I research Avro and see that it is seperated by an uncompressed Schema, followed by the name of the compression codec of the data, then the data itself, which are divided into blocks. Well, I have no idea where a block starts and ends. Somehow the binary in the file gives that info somehow, but I still have no idea, I couldn't get it with a hex editor even. I think Apache.Avro takes a long or a varint somehow, and the hex editor I used doesn't give me that info.
  4. I found the avro-tools.jar tool inside Apache.Avro. To make it easier to use, I made it an executable with launch4j totally superfluous move but whatever. Then I used that cat my avro into 1 row, uncompressed and snappy. I used that as my base and followed the flow of Apache.Avro in the debugger. While also tracking the index of bytes and such with the hex editor and the debugger in C#.
  5. With 1 row, it is guaranteed 1 block. So I ran a loop on the byte start index and end index. I found my Snappy block and was able to decompress it with IronSnappy. I modified the codec portion of my Apache.Avro snappy codec code to make it work with 1 block. (which was basically whatever block Apache.Avro took minus 4 bytes which I assume is the Snappy CRC check which I ignored.
  6. It fails with multi blocks. I found its because Apache.Avro always throws the deflate codec a 4096 byte array after the first block. I reduced it to read size and did the minus 4 size thing again. It worked.

Success! So basically it was copy over deflate as a template for snappy, reduce block byte by 4, then make sure to resize the byte array to block byte size before getting Ironsnappy to decompress.

public override byte[] Decompress(byte[] compressedData)
{
            int snappySize = compressedData.Length - 4;
            byte[] compressedSnappy_Data = new byte[snappySize];
            System.Array.Copy(compressedData, compressedSnappy_Data, snappySize);

            byte[] result = IronSnappy.Snappy.Decode(compressedSnappy_Data);
            return result;
}
                        if (_codec.GetHashCode() == DataFileConstants.SnappyCodecHash)
                        {
                            byte[] snappyBlock = new byte[(int)_currentBlock.BlockSize];
                            System.Array.Copy(_currentBlock.Data, snappyBlock, (int)_currentBlock.BlockSize);
                            _currentBlock.Data = snappyBlock;
                        }

I didn't bother with actually using the checksum as I don't know how or need to? At least not right now. And I totally ignored the compress function.

but if you really want my compress function here it is

        public override byte[] Compress(byte[] uncompressedData)
        {
            return new byte[0];
        }
Ranald Fong
  • 401
  • 3
  • 12
1

The simplest solution would be to use:

ResultModel resultObject = AvroConvert.Deserialize<ResultModel>(byte[] avroObject);

From https://github.com/AdrianStrugala/AvroConvert

  • null
  • deflate
  • snappy
  • gzip

codes are supported

Adrian
  • 96
  • 5
  • Even though I answered my own question, it is not really complete and only allows reading of Snappy Codec. If I ever need to compress or try to build something that reads Avro in the future I'll try this Package. And because this probably works for both compress and decompress I'll mark this as the answer. Thanks! – Ranald Fong Jul 31 '20 at 05:24