1

I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.

However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:

•Data on the source hive table gets updated by 4 AM approx •Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match. •Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data. •Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.

We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.

Please help me with understanding why this would be happening and how I can fix this.

P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.

  • Which version of NiFi and which processor are you using? – Sivaprasanna Sethuraman Aug 08 '18 at 04:56
  • Nifi 1.1.2, Spark 2.2.1, Execute Process processor. For a change, i ran the job yesterday at 9 PM instead of morning hours and the number of records returned in the parquet data is matching with the number of records in the hive table for Aug-06 data. I will run the job again today at 9 PM for Aug-07 data and update if anything new comes up – Praveen Sharma Aug 08 '18 at 10:49

0 Answers0