3

I'm just querying cassandra table using querycassandra processor but what I'm not understanding is how do I pass my Json output file into ExecutePyspark processor as a Input file and later on I need to pass my Spark output data to Hive. Please help me on this, Thanks.

My Query Cassandra Properties:

enter image description here

Pyspark Properties: enter image description here

Karthik Mannava
  • 207
  • 1
  • 5
  • 12

1 Answers1

5

Consider this flow that uses 4 processors as below:

QueryCassandra -> UpdateAttribute -> PutFile -> ExecutePySpark

Step 1: QueryCassandra processor: Execute a CQL on Cassandra and output the result in a flow file.

Step 2: UpdateAttribute processor: Assign the property filename a value containing name for a temporary file on disk that will contain the query results. Use NiFi expression language for generating the file name so that it will be different for each run. Create a property result_directory and assign a value for a folder on disk that NiFi has write permissions to.

  • property: filename
  • value: cassandra_result_${now():toNumber()}

  • property: result_directory

  • value: /tmp

enter image description here

Step 3: PutFile processor: Configure the Directory property with the value ${result_directory} populated in Step 2.

enter image description here

Step 4: ExecutePySpark processor: Pass the filename with its location as an argument to the PySpark application via the PySpark App Args processor property. The application can then have code to read data from the file on disk, process it and write to Hive.

  • property: PySpark App Args
  • value: ${result_directory}/${filename}

enter image description here

Additionally, you could configure more attributes in Step 2 (UpdateAttribute) that could be then passed as arguments in Step 4 (ExecutePySpark) and considered by the PySpark application in writing to Hive (for example, the Hive database and table name).

Jagrut Sharma
  • 4,574
  • 3
  • 14
  • 19
  • Thank you so much Jagrut it's working and able to write data to hive through my spark apllication but my spark application only needs to perform only transformations there and need separate processor to write data into Hive. Is there any mechanisam in spark to genarate a flow file and pass it to the next processor ? – Karthik Mannava Mar 15 '18 at 14:32
  • 1
    @KarthikMannava You might try the ExecuteStream processor so you can read from stdin and write to stdout in your Python script. I don't think there's any built-in support in Spark to write directly to a NiFi FlowFile though. – Greg Hart Mar 15 '18 at 16:37
  • @GregHart How do I call my flow file content in my spark application using execute stream command processor? When I'm trying directly with python read stdin flow files are stucking in queue between queryCassandra and ExecuteStreamCommand Processors. – Karthik Mannava Mar 20 '18 at 05:28
  • @KarthikMannava You would read from stdin and write to stdout. If that's not working you should create a new question showing your Python code. – Greg Hart Apr 03 '18 at 22:38