3

I have the following requirements. I am adding date-wise data to a specific directory in HDFS, and I need to keep a backup of the last 3 sets, and remove the rest. Is there a way to set a TTL for the directory so that the data perishes automatically after a certain number of days?

If not, is there a way to achieve similar results?

frugalcoder
  • 959
  • 2
  • 11
  • 23

1 Answers1

1

This feature is not yet available on HDFS.

There was a JIRA ticket created to support this feature: https://issues.apache.org/jira/browse/HDFS-6382

But, the fix is not yet available.

You need to handle it using a cron job. You can create a job (this could be a simple Shell, Perl or Python script), which periodically deletes the data older than a certain pre-configured period.

This job could:

  • Run periodically (For e.g. once an hour or once a day)
  • Take the list of folders or files which need to be checked, along with their TTL as input
  • Delete any file or folder, which is older than the specified TTL.

This can be achieved easily, using scripting.

Manjunath Ballur
  • 6,287
  • 3
  • 37
  • 48