1

In my database I have multiple tables where each table is a different entity type. I have an Avro schema that I use in hadoop which is a union of all the fields of these different entity types plus it has a entity type field.

What I would like to do is something along the lines of setting up a DBInputFormat with a DBWritable for each entity type that maps the entity type to the combined Avro type. Then give each DBInputFormat to something like MultipleInputs so that I can create a composite input format. The composite input format could then be given to my map reduce job so that all of the data from all the tables could be processed at once by the same mapper class.

Data is constantly added to these database tables so I need to be able to configure the DBInputFormat for each entity type/dbtable to only grab the new data and to do the splits properly.

Basically I need the functionality of DBInputFormat or DataDrivenDBInputFormat but also the ability to make a composite of them similar to what you can do with paths and MultipleInputs.

user533020
  • 137
  • 1
  • 3
  • 9

1 Answers1

1

Create a view from the N input tables and set the view in the DBInputFormat#setInput. According to the Cloudera article. So, I guess data should not be updated in the table for the time the job completes.

Hadoop may need to execute the same query multiple times. It will need to return the same results each time. So any concurrent updates to your database, etc, should not affect the query being run by your MapReduce job. This can be accomplished by disallowing writes to the table while your MapReduce job runs, restricting your MapReduce’s query via a clause such as “insert_date < yesterday,” or dumping the data to a temporary table in the database before launching your MapReduce process.

Evaluate frameworks which support real time processing like Storm, HStreaming, S4 and Strembases. Some of these sit on top of Hadoop and some don't, some are FOSS and some are commercial.

Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
  • I'm afraid this might be the only answer short of writing my own InputFormat. The problem with this approach is that then ALL of the mappers will be querying ALL of the tables which will be much less efficient than a subset of the mappers querying one table, another subset querying a different table, etc. I'd have to make the window size that each mapper grabs much smaller since it's grabbing that window from every table. Each window grab would be a full table scan of the table so there would be significantly more full table scans. – user533020 Dec 05 '11 at 05:44
  • Or else dump the tables into multiple files with export utility and use [MultipleInputs](http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/MultipleInputs.html). If you happen to write your own input format then try to contribute it back to Apache. – Praveen Sripati Dec 05 '11 at 07:21