1

I have a Spark job that read millions of records from HDFS, processes them, and writes back to HDFS in AVRO format. Observed that many files (written) remain in .avro.tmp state.

I am using Kite SDK for writing data in AVRO format. The environment is CDH 5.5.

Could it be because the Spark job terminates as soon as it is done with reading records and sending them to executors (which actually does the writing?)

If that's the case, how do I ensure that the job does not terminate until all .tmp are converted into .avro? Or what else could be the reason?

halfer
  • 19,824
  • 17
  • 99
  • 186
Sudhanshu Umalkar
  • 4,174
  • 1
  • 23
  • 33

1 Answers1

0

Got it working after I closed the writer within call() method itself after iterating through all the records. Major drawback here is that for each partition I am obtaining a new writer, need to find a better way.

     df.toJavaRDD().foreachPartition(new VoidFunction<Iterator<Row>>() {

        @Override
        public void call(Iterator<Row> iterator) throws Exception {

            final DatasetWriter writer = // obtain writer

            while (iterator.hasNext()) {
                // process the records; write to HDFS using writer
            }

            writer.close(); // this ensures that .avro.tmp is converted to .avro
        }
    });
Sudhanshu Umalkar
  • 4,174
  • 1
  • 23
  • 33