-1

I'm attempting to write an avro file from python, for the most part following the official tutorial.

I have what appears to be a valid schema:

{"namespace": "example.avro",
 "type": "record",
 "name": "Stock",
 "fields": [
     {"name": "ticker_symbol", "type": "string"},
     {"name": "sector",  "type": "string"},
     {"name": "change", "type": "float"},
     {"name": "price",  "type": "float"}
 ]
}

Here is the relevant code

avro_schema = schema.parse(open("stock.avsc", "rb").read())
output = BytesIO()
writer = DataFileWriter(output, DatumWriter(), avro_schema)

for i in range(1000):
    writer.append(_generate_fake_data())
writer.flush()

with open('record.avro', 'wb') as f:
    f.write(output.getvalue())

However, when I try to read the output from this file using the cli avro-tools:

avro-tools fragtojson --schema-file stock.avsc ./record.avro  --no-pretty

I get the following error:

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/Cellar/avro-tools/1.8.2/libexec/avro-tools-1.8.2.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -40
    at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
    at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
    at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
    at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
    at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:414)
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:181)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
    at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
    at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
    at org.apache.avro.tool.BinaryFragmentToJsonTool.run(BinaryFragmentToJsonTool.java:82)
    at org.apache.avro.tool.Main.run(Main.java:87)
    at org.apache.avro.tool.Main.main(Main.java:76)

I'm pretty sure the relevant error is

 Malformed data. Length is negative: -40

But I can't tell what I'm doing wrong. My suspicion is that I'm writing the avro file incorrectly.

I want to write to a bytes array (instead of directly to a file like in the example) because ultimately I'm going to ship this avro buffer off to AWS Kinesis Firehose using boto3.

Drise
  • 4,310
  • 5
  • 41
  • 66
Erty Seidohl
  • 4,487
  • 3
  • 33
  • 45
  • The [tag:python] tag is enough to convey you are using python. Don't add it unnecessarily to the title. See [The Bad](https://meta.stackexchange.com/a/112966/186397) section if you are curious why. – Drise Mar 14 '18 at 20:27
  • sounds good, thanks for the edits. – Erty Seidohl Mar 14 '18 at 20:27
  • I was using the wrong tool to read the file. I should have used `avro-tools tojson ./record.avro` instead of fragtojson as in the question. The difference is that fragtojson is used for a single avro datum, whereas tojson is used for an entire file. – Erty Seidohl Mar 14 '18 at 21:58

2 Answers2

0

I was using the wrong tool to read the file. I should have used

avro-tools tojson ./record.avro

instead of fragtojson as in the question. The difference is that fragtojson is used for a single avro datum, whereas tojson is used for an entire file.

Erty Seidohl
  • 4,487
  • 3
  • 33
  • 45
0

I want to write to a bytes array (instead of directly to a file like in the example) because ultimately I'm going to ship this avro buffer off to AWS Kinesis Firehose using boto3.

So you don't need to use DataFileWriter, what you need is this :

datum_writer = io.DatumWriter(avro_schema)

output = io.BytesIO()
encoder = avro.io.BinaryEncoder(output)
for i in range(1000):
    datum_writer.write(_generate_fake_data(), encoder)

data_bytes = output.getvalue()

If you want to print the content of data_bytes, you have just to decode it using BinaryDecoder

Tuki
  • 81
  • 4
  • If I do that, the schema isn't included in the output avro file, is that correct? – Erty Seidohl Mar 14 '18 at 22:03
  • My understanding is that the schema should be included in the payload that I send to aws – Erty Seidohl Mar 14 '18 at 22:04
  • @ErtySeidohl yes, the schema isn't included, only the binary data, but nothing prevents you from transmitting the schema in another way, wrapping your data_bytes with other metadata in the same kinesis flow for example... it's worth a try :) – Tuki Mar 14 '18 at 22:46