6

I am trying to implement the workflow below using Apache Nifi:

  1. ExecuteSQL - This is fetching data from oracle database in avro format
  2. PutHDFS –This is to put the data into hdfs
  3. ExecuteProcess – This processor executes a bash script in the background which in turn creates the external hive table

I have a few questions:

Does ExecuteProcess Processor in Apache Nifi takes incoming flow files?

I am not able to provide ExecuteProcess processor any incomming flow file. If not, is there any way to support taking incoming flow files?

Binary Nerd
  • 13,872
  • 4
  • 42
  • 44
Anonymous
  • 337
  • 2
  • 5
  • 12

2 Answers2

10

ExecuteProcess does not allow incoming flow files. Take a look at the ExecuteStreamCommand processor, it accepts incoming flow files and also executes an external command.

Bryan Bende
  • 18,320
  • 1
  • 28
  • 39
  • I need some more clarifications.In the workflow which I have mentioned above .Execute SQl processor will fetch data from oracle db in avro format and keep those in some location?I need to know where its placing those files.Because as per the flow next step is PutHdfs processor,which will copy file from the local machine to HDFS location.So which will be the location in the local machine? – Anonymous Jun 13 '16 at 05:11
  • 1
    Once the data is in NiFi it is kept in NiFi's internal repositories which are controlled by properties set in conf/nifi.properties. ExecuteSQL will fetch data from the database and create a flow file which stores the records in NiFi's content repository, it will then transfer the flow file to the success relationship connected to PutHDFS, and PutHDFS will read the records from the content repository, you won't really have to know where it is. – Bryan Bende Jun 14 '16 at 11:44
3

This approach (with ExecuteStreamCommand) should work for the current NiFi version. NiFi 1.0.0 will have a ConvertAvroToORC processor which can translate the Avro records coming from ExecuteSQL into the more Hive-efficient ORC format, and it also generates (into an attribute) the Hive DDL needed to create the table (if it doesn't already exist). Also there will be a PutHiveQL processor which can execute that DDL.

That should remove the need for the ExecuteStreamCommand in the above flow. I will post an example template at https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates when NiFi 1.0.0 is released.

mattyb
  • 11,693
  • 15
  • 20