Apache Avro in File Processing

Question

What is the use of Apache Avro in file processing? Can anybody explain to me, is it useful if I need to process TBs of data in .LZO format?

I have a choice between C++ and Java, what will fit more perfectly with Avro?

My real purpose is to read compressed files and categorize them to new different files according to some criteria.

Thank you in Advance.... :)

// Serialize user1, user2 and user3 to disk
  DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>    (User.class);
  DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
 dataFileWriter.create(user1.getSchema(), new File("users.avro"));
 dataFileWriter.append(user1);
 dataFileWriter.append(user2);
 dataFileWriter.append(user3);
 dataFileWriter.close();

Nigel Savage · Answer 1 · 2015-07-21T13:34:38.157

In map-reduce ad data analysis it can help you avoid bottlenecks. In a typical ETL flow there will be times when everything is dependent on some big chunk of data to get from point A to point B, if the data is compressed if gets transported quicker.

Also the file structure is optimized for hadoop, its similar to a 'hadoop sequence file'. LZO lacks the specific optimization structures for hadoop however progress is being made

http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

Arvo files are language agnostic, both LZO and arvo have a C interface, from that post they are working on some pig udf's so I would expect to see some Pig LZO bridge to hdfs at some point in the near future.

Arvo files are schema based, http://avro.apache.org/docs/current/spec.html#schemas

this is useful as you can discover the format/structure of the file at run-time based on its schema

Documentation is a good place to start http://avro.apache.org/docs/current/

Apache Avro in File Processing

1 Answers1