0

I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.

When the avro files hit hdfs, I want the ability to read them into hive tables later.

I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.

I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.

Defcon
  • 807
  • 3
  • 15
  • 36

1 Answers1

0

If you are on cloudera(since u have flume, may u have it), you can use morphline to work on conversion at record level. You can use batch/streaming. You can see here for more info.

Ramzy
  • 6,948
  • 6
  • 18
  • 30
  • Do you know if morphline can read xml and then convert to avro? – Defcon Jun 01 '16 at 18:33
  • Morphline can read xml contents, and can write to avro. There isn't a direct conversion command give for your use case. Either you can research on morphline usage or may be plan a map reduce/spark job to read each file/record and convert to avro. Morphline has ready availability to batch way and flume usage. – Ramzy Jun 01 '16 at 21:50