Mapreduce Table Diff

Question

I have two versions (old/new) of a database table with about 100,000,000 records. They are in files:

trx-old
trx-new

The structure is:

id date amount memo
1  5/1     100 slacks
2  5/1      50 wine

id is the simple primary key, other fields are non-key. I want to generate three files:

trx-removed (ids of records present in trx-old but not in trx-new)
trx-added   (records from trx-new whose ids are not present in trx-old)
trx-changed (records from trx-new whose non-key values have changed since trx-old)

I need to do this operation every day in a short batch window. And actually, I need to do this for multiple tables and across multiple schemas (generating the three files for each) so the actual app is a bit more involved. But I think the example captures the crux of the problem.

This feels like an obvious application for mapreduce. Having never written a mapreduce application my questions are:

is there some EMR application that already does this?
is there an obvious Pig or maybe Cascading solution lying about?
is there some other open source example that is very close to this?

PS I saw the diff between tables question but the solutions over there didn't look scalable.

PPS Here is a little Ruby toy that demonstrates the algorithm: Ruby dbdiff

2. Yes, the Pig solution at least for the added and removed parts is a blatantly obvious `LEFT OUTER JOIN` and `FILTER` based on whether the joined column is `null`. As for "changed", my best guess is an inner `JOIN` and filter based on whether the fields differ. — TC1, May 05 '13 at 20:07

score 2 · Answer 1 · answered May 04 '13 at 20:05

2

I think it would be easiest just to write your own job, mostly because you'll want to use MultipleOutputs to write to the three separate files from a single reduce step when the typical reducer only writes to one file. You'd need to use MultipleInputs to specify a mapper for each table.

answered May 04 '13 at 20:05

Chris Gerken

16,221
6
44
59

From looking at all the solutions I wonder whether it's better to use the canned CoGroup and maybe some redundant filtering or whether it's better to hand-craft a solution that writes to the MultipleOutputs on the fly. – Bill Burcham May 06 '13 at 00:16

smttsp · Answer 2 · 2013-05-06T08:13:27.317

What come to my mind is that:

Consider your tables are like that:

Table_old
1    other_columns1
2    other_columns2
3    other_columns3

Table_new 
2    other_columns2
3    other_columns3
4    other_columns4

Append table_old's elements "a" and table_new's elements "b".

When you merge both files and if an element exist on the first file and not in the second file this is removed

table_merged
1a    other_columns1
2a    other_columns2
2b    other_columns2
3a    other_columns3
3b    other_columns3
4a    other_columns4

From that file you can do your operations easily.

Also, let say your id's are n digits, and you have 10 clusters+1 master. Your key would be 1st digit of id, therefore, you divide the data to clusters evenly. You would do grouping+partitioning so your data would be sorted.

Example,

table_old
1...0 data
1...1 data
2...2 data

table_new
1...0 data
2...2 data
3...2 data

Your key is first digit and you do grouping according to that digit, and your partition is according to rest of id. Then your data is going to come to your clusters as

worker1
1...0b data
1...0a data
1...1a data

worker2 
2...2a data
2...2b data and so on.

Note that, a, b doesnt have to be sorted.

EDIT Merge is going to be like that:

FileInputFormat.addInputPath(job, new Path("trx-old"));
FileInputFormat.addInputPath(job, new Path("trx-new"));

MR will get two input and the two file will be merged,

For the appending part, you should create two more jobs before Main MR, which will have only Map. The first Map will append "a" to every element in first list and the second will append "b" to elements of second list. The third job(the one we are using now/main map) will only have reduce job to collect them. So you will have Map-Map-Reduce.

Appending can be done like that

//you have key:Text
new Text(String.valueOf(key.toString()+"a"))

but I think there may be different ways of appending, some of them may be more efficient in (text hadoop)

Hope it would be helpful,

That was helpful, and I think I get the gist. But you assume some MR knowledge I don't possess. In particular, I don't know exactly how to "Append" and then to "merge" as you suggest early on. — Bill Burcham, May 06 '13 at 00:09
I editted my post. I can say that my code became harder, if there is any easier ready code or ready library related to that you would better use it. — smttsp, May 06 '13 at 08:15

score 1 · Answer 3 · answered May 04 '13 at 20:56

This seems like the perfect problem to solve in cascading. You have mentioned that you have never written MR application and if the intent is to get started quickly (assuming you are familiar with Java) then Cascading is the way to go IMHO. I'll touch more on this in a second.

It is possible to use Pig or Hive but these aren't as flexible if you want to perform additional analysis on these columns or change schemas since you can build your Schema on the fly in Cascading by reading from the column headers or from a mapping file you create to denote the Schema.

In Cascading you would:

Set up your incoming Taps : Tap trxOld and Tap trxNew (These point to your source files)
Connect your taps to Pipes: Pipe oldPipe and Pipe newPipe
Set up your outgoing Taps : Tap trxRemoved, Tap trxAdded and Tap trxChanged
Build your Pipe analysis (this is where the fun (hurt) happens)

trx-removed : trx-added

Pipe trxOld = new Pipe ("old-stuff");
Pipe trxNew = new Pipe ("new-stuff");
//smallest size Pipe on the right in CoGroup
Pipe oldNnew = new CoGroup("old-N-new", trxOld, new Fields("id1"), 
                                       trxNew, new Fields("id2"), 
                                       new OuterJoin() );

The outer join gives us NULLS where ids are missing in the other Pipe (your source data), so we can use FilterNotNull or FilterNull in the logic that follows to get us final pipes that we then connect to Tap trxRemoved and Tap trxAdded accordingly.

trx-changed

Here I would first concatenate the fields that you are looking for changes in using FieldJoiner then use an ExpressionFilter to give us the zombies (cause they changed), something like:

Pipe valueChange = new Pipe("changed");
valueChange = new Pipe(oldNnew, new Fields("oldValues", "newValues"), 
            new ExpressionFilter("oldValues.equals(newValues)", String.class),
            Fields.All);

What this does is it filters out Fields with the same value and keeps the differences. Moreover, if the expression above is true it gets rid of that record. Finally, connect your valueChange pipe to your Tap trxChanged and your will have three outputs with all the data you are looking for with code that allows for some added analysis to creep in.

This looks pretty close to an actual solution. CoGroup is just the magic outer join thingie I need! It makes me wonder what the performance of CoGroup is and whether it can benefit from the fact that my tables are pre-sorted. — Bill Burcham, May 06 '13 at 00:11
The performance of joins in MapReduce should always be expected to be slow-ish (technical term). The framework is not adept at expediting Joins (just ask Hive of PIG), and in general Hadoop should be thought of as an 18 wheeler pulling a tons of weight slow but better than using 1 courageous car. Now, to your second point about presorted data, I am not aware of it's impact on optimizations. I do know CoGroup sorts the group key by their natural order, which tells me it's built in. I am compelled to believe that presorted will run faster than unsorted in the GroupBy pipe. — Engineiro, May 06 '13 at 13:47

harpun · Answer 4 · 2013-05-05T20:25:07.930

1

As @ChrisGerken suggested, you would have to use MultipleOutputs and MultipleInputs in order to generate multiple output files and associate custom mappers to each input file type (old/new).

The mapper would output:

key: primary key (id)
value: record from input file with additional flag (new/old depending on the input)

The reducer would iterate over all records R for each key and output:

to removed file: if only a record with flag old exists.
to added file: if only a record with flag new exists.
to changed file: if records in R differ.

As this algorithm scales with the number of reducers, you'd most likely need a second job, which would merge the results to a single file for a final output.

edited May 05 '13 at 20:25

answered May 05 '13 at 20:00

harpun

4,022
1
36
40

There is no answer related to HOW?. You've just said if the record doesnt exist in the second file count it as removed. This exists in question. – smttsp May 05 '13 at 20:18
@smttsp: I hope this clears things up. The reducer will get the following input and mark it as removed: `(1, {old1})`, added: `(2, {new2})`, changed: `(3, {old3, new3})`. – harpun May 05 '13 at 20:24
I wonder how @Engineiro's solution (using CoGroup) would compare to this solution performance-wise. – Bill Burcham May 06 '13 at 00:14

Mapreduce Table Diff

4 Answers4