Combining two different files in Hadoop

Question

I have a very specific problem in Hadoop.

I have two files userlist and *raw_data*. Now raw_data is a pretty big file and userlist is a comparatively smaller than the other file.

I have to first identify the number of mappers and my userlist has to broken down to pieces equal to the number of mappers. Later it has to be loaded into distributed cache and it has to compare with userlist and perform some analytics and write it to reducer.

Please suggest.

thank you.

Did you accidentaly write `"...it has to compare with userlist and perform some analytics"`, instead of `"...it has to compare with raw-data and perform some analytics"`? — vefthym, Feb 15 '14 at 18:12

score 0 · Answer 1 · answered Feb 16 '14 at 06:44

I do not understand why you want to partition the userlist file. If it is small then, load the entire userlist file to distributed cache. Then in the setup method of map class, every mapper will have access to the entire userlist file. Moreover, you can find out the number of mappers and partition it as you prefer in the setup method.

Combining two different files in Hadoop

1 Answers1