0

Was trying to figure out if joins can be achieved with apache NiFi or Streamsets. So that i can read from HBase periodically, join with other tables and write few fields into a Hive table.

Or is there any other workflow manager tool that supports this operation?

Srihari Karanth
  • 2,067
  • 2
  • 24
  • 34

1 Answers1

3

I'm not familiar with Streamsets but I will try to help with NiFi. Is your flat file static? If so, are you looking to do a straight replace of values? You should be able to use the ReplaceTextWithMapping processor for that. If not a straight replace, you could pre-populate a DistributedMapCache with the values from the flat file, then use FetchDistributedMapCache to do a lookup against the HBase record(s).

If all else fails, then if you are comfortable with a scripting language such as Groovy, Javascript, or Jython, you could write the "join" part using ExecuteScript or InvokeScriptedProcessor.

There is an open Jira case (with some good progress made) on a lookup/enrichment processor that supports CSV files, properties files, and in-memory lookups.

mattyb
  • 11,693
  • 15
  • 20
  • Thanks, FetchDistributedMapCache seems to be the one i am looking for. Can it also do periodically. eg i have a table that keeps getting populated with new rows and i want to aggregate once every hour for the previous hour data (with joins on other static tables). So will NiFi remember which hours are aggregated already and which needs to picked for the rest hours? And aggregation will have SUM/AVG on few columns. – Srihari Karanth May 03 '17 at 15:43
  • It won't do aggregation, the cache is only for lookup. In the upcoming NiFi 1.2.0 release you can use UpdateAttribute to keep a running count/sum as files flow through – mattyb May 03 '17 at 16:30