0

I have a use case where I am pushing the data from Mongodb to HDFS in orc file which runs every 1 day interval and appends the data in orc file existing in hdfs.

Now my concern is if while writing to orc file , the job somehow gets failed or stopped. How should I handle that scenario taking in consideration that some data is already written in orc file. I want to avoid duplicate in orc file.

Snippet for writing to orc file format -

  val df = sparkSession
          .read
          .mongo(ReadConfig(Map("database" -> "dbname", "collection" -> "tableName")))
          .filter($"insertdatetime" >= fromDateTime && $"insertdatetime" <= toDateTime)

        df.write
          .mode(SaveMode.Append)
          .format("orc")
          .save(/path_to_orc_file_on_hdfs)

I don't want to go for checkpoint the complete RDD as it will be very expensive operation. Also, I don't want to create multiple orc file. Requirement is to maintain single file only.

Any other solution or approach I should try ?

tenderfoot
  • 69
  • 1
  • 9

1 Answers1

0

Hi one of the best approach will be to write you data as one folder per day under HDFS.

So if you ORC writing job fails you will be able to clean up the folder.

The cleaning should occurs in the bash side of you job. If return code != 0 then delete ORC folder. And then retry.

Edit : a partitionning by writing date will be more powerfull on you ORC reading later on with spark

airliquide
  • 520
  • 7
  • 16
  • You mean to say that I should write to separate ORC file every day ? but as I have mentioned I don't want to maintain multiple files. I want single orc file where I should be able to append the data on daily basis. So, any approach to fulfill this approach or I should go with separate file each day only ? – tenderfoot Feb 05 '20 at 05:48
  • 1
    Yes. It's no possible to append to an ORC file. But reading will be easy with a wild card to avoid all the dates on the reading side. This is the best and more efficient way – airliquide Feb 05 '20 at 08:01