0

I have a file in the distributed cache. The driver class, based on the output of a job, updates this file and starts a new job. The new job need these updates.

The way I currently do it is to replace the old Distributed Cache file with a new one (the updated one).

Is there a way of broadcasting the diffs (between the old file and the new one) to all the tasks trackers which need the file ?

Or is it the case that, after a job (the first one, in my case) is finished, all the directories/files specific to that job are deleted and consequently it doesn't even make sense to think in this direction ?

Razvan
  • 9,925
  • 6
  • 38
  • 51

1 Answers1

0

I think that distributed cache is not build with such scenario in mind. It simply put files locally.
In Your case I would suggest to put file in HDFS and make all interested parties to take it from there
As an optimization you can give this file high replication factor and it will be local to most of the tasks.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30