I have two versions (old/new) of a database table with about 100,000,000 records. They are in files:
trx-old
trx-new
The structure is:
id date amount memo
1 5/1 100 slacks
2 5/1 50 wine
id is the simple primary key, other fields are non-key. I want to generate three files:
trx-removed (ids of records present in trx-old but not in trx-new)
trx-added (records from trx-new whose ids are not present in trx-old)
trx-changed (records from trx-new whose non-key values have changed since trx-old)
I need to do this operation every day in a short batch window. And actually, I need to do this for multiple tables and across multiple schemas (generating the three files for each) so the actual app is a bit more involved. But I think the example captures the crux of the problem.
This feels like an obvious application for mapreduce. Having never written a mapreduce application my questions are:
- is there some EMR application that already does this?
- is there an obvious Pig or maybe Cascading solution lying about?
- is there some other open source example that is very close to this?
PS I saw the diff between tables question but the solutions over there didn't look scalable.
PPS Here is a little Ruby toy that demonstrates the algorithm: Ruby dbdiff