0

I recently started working on a content repository migration project between two different content management systems.

We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.

We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.

What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?

galmeriol
  • 461
  • 4
  • 14
  • What does "11 pb documents" mean? – David Anderson Mar 30 '18 at 13:21
  • Why isn't this just a simple script in python or ruby? – David Anderson Mar 30 '18 at 13:23
  • @DavidAnderson it means 11 petabytes of documents and we use Java so platform should have Java API. – galmeriol Mar 30 '18 at 13:40
  • If it's 11 PB of data, and you're doing it "one document at a time", then...won't that take a really, really long time? How many documents are you talking about? – kkrugler Mar 31 '18 at 01:58
  • @kkrugler “One document at a time” is meant to emphasize independency between documents and may not be the correct way to put it, sorry about that. And we have around 1 billion docs - it is actually stored in relational DB, not physically exist- so we need some API to API streaming approach. – galmeriol Mar 31 '18 at 06:07

2 Answers2

2

Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).

While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.

If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • That was the answer i need actually. There is no dependency between documents in migration so i'll need both stream management capabilities and parallel processing. Thank you. – galmeriol Mar 30 '18 at 14:30
1

Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.

This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".

kkrugler
  • 8,145
  • 6
  • 24
  • 18