0

The problem is the following.

Steps:

  1. An application converts some custom object to an avro fragment (byte array);
  2. This avro fragment is sent to an event hub in an EventData object;
  3. The event hub trigger an azure function that receives an Mcrosoft.ServiceBus.Messaging.EventData from the event hub;
  4. I can extract the body of the EventData and it contains the avro fragment (byte array) of point 1.

I'm using Microsoft.Hadoop.Avro.

I've the schema of the original custom object (point 1) so I tried to create a generic reader that read from the avro fragment but I receive the following error:

Invalid Avro object container in a stream. The header cannot be recognized.

It seems that Microsoft.Hadoop.Avro is able to manage only complete avro file (header + schema + body) and not avro fragment (body).

With java avro-tool I can add a schema to an avro fragment. Is it possible also in .Net or .Net Core? How can I do?

For simplicity in the following code I replaced the EventData that comes form the event hub with the related avro file.

using (Stream stream = new FileStream(@"...\trip-real-0-2019-03-14-12-14.avro", FileMode.Open, FileAccess.Read, FileShare.Read))
{
    // create a generic reader for the event hub avro message
    using (var reader = AvroContainer.CreateGenericReader(stream))
    {
        while (reader.MoveNext())
        {
            foreach (dynamic record in reader.Current.Objects)
            {
                //get the body of the event hub message (fragment avro bytes)
                var avroFragmentByeArray = (byte[])(record.Body);

                // try to create a generic reader with the schema.
                // this line throws an exception
                using (var r = AvroContainer.CreateGenericReader(schema, new MemoryStream(avroFragmentByeArray), true, new CodecFactory()))                                    
                {

                }
            }
        }
    }
}
trincot
  • 317,000
  • 35
  • 244
  • 286
C. Fabiani
  • 129
  • 2
  • 12

1 Answers1

0

I found how to do it. There are two ways:

  1. use avro-tool.jar from C#;
  2. use the Apache Avro library (recommended).

1° Solution First get the bytes in the event data message and save its locally.

public List<string> SaveAvroBytesOnFile(EventData eventHubMessage, string functionAppDirectory)
    {
        try
        {                
            string fileName = "avro-bytes.avro";
            List<string> filesToProcess = new List<string>();
            string singleFileNameToSave = fileName;
            filesToProcess.Add(singleFileNameToSave);              
            string path = Path.Combine(functionAppDirectory,"AvroBytesFiles");  
            System.IO.Directory.CreateDirectory(path);              
            File.WriteAllBytes($"{path}{singleFileNameToSave}", eventHubMessage.GetBytes());                
            return filesToProcess;
        }
        catch (Exception ex)
        {
            throw;
        }
    }

Than call avro-tool.jar from the azure function and redirect the output in a variable

 Process myProcess = new Process();
 myProcess.StartInfo.UseShellExecute = false;
 myProcess.StartInfo.FileName = @"D:\Program Files\Java\jdk1.8.0_73\bin\java.exe";                   
 // execute avro tools         
 string avroResourcesPath = Path.Combine(functionAppDirectory, "AvroResources");
 // here you must use the file with the bytes saved before and the avroschema file
 myProcess.StartInfo.Arguments = $"-jar {Path.Combine(avroResourcesPath, "avro-tools-1.8.2.jar")} fragtojson --schema-file {Path.Combine(avroResourcesPath, "schemafile.avsc")} {Path.Combine(functionAppDirectory, "AvroBytesFiles", byteFileNames[i])}";
 myProcess.StartInfo.RedirectStandardOutput = true;
 myProcess.Start();
 // print the output to a string 
 string output = myProcess.StandardOutput.ReadToEnd();
 myProcess.WaitForExit();

Avro-tool may deserialize the bytes with a different schema from what you need so you need to map the avro-tool model on your model. This step can consume many resources as the model's complexity changes.

AvroToolModel avroToolModel= JsonConvert.DeserializeObject<AvroTool>(output);
// map the avro model in my model
MyMode myModel = new MyModel(avroToolModel);

2° Solution

This is the recommended solution. The deserialization can be performed with few lines.

string schema = @"...";
using (MemoryStream memStream = new MemoryStream(eventHubMessage.GetBytes()))
{
   memStream.Seek(0, SeekOrigin.Begin);
   Schema writerSchema = Schema.Parse(schema);
   Avro.Specific.SpecificDatumReader<MyModel> r = new Avro.Specific.SpecificDatumReader<MyModel>(writerSchema, writerSchema);
   output = r.Read(null, new Avro.IO.BinaryDecoder(memStream));
}

The model class must implement the ISpecificRecord interface as follow:

[DataContract]
public class MyModel: ISpecificRecord
{
    [DataMember]
    public string Id;
    [DataMember]
    public enumP Type;
    [DataMember]
    public long Timestamp;
    public Dictionary<string, string> Context;

    public static Schema _SCHEMA = Avro.Schema.Parse(@"...");

    public virtual Schema Schema
    {
        get
        {
            return Position._SCHEMA;
        }
    }

    public object Get(int fieldPos)
    {
        switch (fieldPos)
        {
            case 0: return this.Id;
            case 1: return this.Timestamp;
            case 2: return this.Type;                
            case 3: return this.Context;
            default: throw new AvroRuntimeException("Bad index " + fieldPos + " in Get()");
        };
    }

    public void Put(int fieldPos, object fieldValue)
    {
        switch (fieldPos)
        {
            case 0: this.Id = (System.String)fieldValue; break;
            case 1: this.Timestamp = (System.Int64)fieldValue; break;
            case 2: this.Type = (enumP)fieldValue; break;                
            case 3: this.Context = (Dictionary<string,string>)fieldValue; break;
            default: throw new AvroRuntimeException("Bad index " + fieldPos + " in Put()");
        };
    }
}

[DataContract]
public enum enumP
{
    ONE, TWO, THREE
}

The names of the properties in the class MyModel must be the same in the schema used.

C. Fabiani
  • 129
  • 2
  • 12