Processing unstructured data with Hadoop + MapReduce

Question

I would like to use hadoop to process unstructured CSV files. These files are unstructured in the sense that they contain multiple data values from different types with varying row lengths. In addition, there are hundreds of these files and they are often relatively large in size (> 200Mb).

The structure of each file can be demonstrated like so:

Book     , ISBN          , BookName                     , Authors    , Edition
Book     , 978-1934356081, Programming Ruby 1.9         , Dave Thomas, 1
Book     , 978-0596158101, Programming Python           , Mark Lutz  , 4
...
BookPrice, ISBN          , Store                        , Price
BookPrice, 978-1934356081, amazon.com                   , 30.0
BookPrice, 978-1934356081, barnesandnoble.com           , 30.67
BookPrice, 978-0596158101, amazon.com                   , 39.55
BookPrice, 978-0596158101, barnesandnoble.com           , 44.66
...
Book     , ISBN          , BookName                     , Authors    , Edition
Book     , 978-1449311520, Hadoop - The Definitive Guide, Tom White  , 3
...

The files are generated automatically, and I have no control over the given structure. Basically, there's a header row followed by data rows containing values matching the headers. The type of row can be identified by the first comma separated word. So, from the example, the Book row contains metadata about books (name, isbn, author, edition), and the BookPrice contains the various prices for the books for different outlets/vendors.

I'm trying to understand how to use Map/Reduce to perform some aggregate calculations on the data. Having the data structured the way it is makes it more difficult to understand what key -> value pairs to extract in each phase.

For example, I'd like to calculate the AVERAGE, MAX and MIN prices for each book (can be joined/grouped by ISBN). I realize I can do some pre-processing to extract that data to ordered, one-type CSV files and work from there (using grep, python, awk, etc), but that'll defeat the point of using M/R+Hadoop, and will require a lot of additional work.

I thought about using multiple map stages, but I'm fairly new to all of this and not sure how/where to start.

How do I go about implementing such M/R job (in Java) for the sample file/query? Thanks.

score 3 · Answer 1 · answered Mar 20 '12 at 08:11

3

I faced somwhat similar case and did the following design:
I have developed input format which use OpenCSV parser to actually split records. Then I filled MapWritable as a value. Each map contain one record with "fieldName->field value" entries.
In your case I would make Key something like enumerator containing record type like "price record", "authors records" etc.

Then in you mapper you can write relatively simple code which will recognize records of interest and aggregate them.

A bit more complicated but more rewarding way would be to create SerDe for the Hive which will map the files into table of structure : Record type (described above) and KeyValueMap columns. (Hive support map type for the column). Then you would be able to make SQLs against your semi-structured data.

answered Mar 20 '12 at 08:11

David Gruzman

7,900
1
28
30

very cool approach, thanks. Do you mind sharing some of your actual implementation details/code? – sa125 Mar 20 '12 at 09:19
David how did u seperate the header from the file?http://stackoverflow.com/questions/21040166/aggregation-in-mapreduce – USB Jan 17 '14 at 09:35
Usually I write input format, derived from TextInputFormat, which give special handling to the first string / strings – David Gruzman Jan 17 '14 at 12:44

Processing unstructured data with Hadoop + MapReduce

1 Answers1