0

I have streaming data coming into my consumer app that I ultimately want to show up in Hive/Impala. One way would be to use Hive based APIs to insert the updates in batches to the Hive Table.

The alternate approach is to write the data directly into HDFS as a avro/parquet file and let hive detect the new data and suck it in.

I tried both approaches in my dev environment and the 'only' drawback I noticed was high latency writing to hive and/or failure conditions I need to account for in my code.

Is there an architectural design pattern/best practices to follow?

Neel
  • 9,913
  • 16
  • 52
  • 74
  • What kind of tools did you use for your tests: Flume, Kafka connector for HDFS, Spark Streaming, Storm... or good'old'reinvent'the'wheel Java code? – Samson Scharfrichter Dec 26 '15 at 08:56
  • I am using the Kite SDK to update records in hive -- don't know if that's good or bad. For writing to HDFS, I am using the the Spark libraries and then forcing Hive to load the data using 'msck repair'. My question is: what's the best way to record data into hive? Directly (using Kite or other libraries) or HDFS-> Hive? – Neel Dec 26 '15 at 15:52
  • IMHO there are only *niche* solutions, depending on your actual requirements -- e.g. write latency, write consistency *(at least once? at most once? exactly once?)*, availability... -- and the tools that you already have at hand. Then re-evaluate your options every 6 months, given how fast the landscape changes. Anyone claiming to have the one and only "best solution" for streaming is bound to trigger a religious war. Happy New Year 0:-) – Samson Scharfrichter Jan 02 '16 at 21:48
  • Ah, there's also the matter of volume and life cycle *(average / peak input rate, retention, etc)*. In some edge cases you might be interested in HBase/Phoenix, with "fast" SQL queries handled by Phoenix and "batch" queries handled by Hive on HBase snapshots. It all depends. – Samson Scharfrichter Jan 02 '16 at 21:55

0 Answers0