How to process a large file in Hadoop?

Question

This is a noobie question

I have a hadoop setup and thinking of uisng Giraph or Hama for graph based computation. I have a large file in the form

3 4 3 7 3 8 5 6

where each column denotes vertices and each row denote edges. For normal programs I read the whole file into a form like

3: [4,7,8] 5: [6]

which means vertex 3 has got edges to 4,7,8 and 5 has edges to 6.

How to handle this condition for a large file in Hadoop? Reading like this means loading whole contents to RAM? What is the best way to do it in Hadoop?

score 0 · Answer 1 · answered Jun 12 '14 at 03:44

Hadoop do the horizontal parallelism. For a large input file it will divide the input into some smaller file(obviously defined by the user). And then send the smaller sets to different nodes. So that you don't need to load a big input file in your single machine with limited memory. Up to here hadoop framework do the labour.

After that you need to implement your business/domain logic. You have to generate some key value pair from your input set. After that Hadoop will send all the key-value pair to the next stage. It will give you all the unique key value pairs and you have to merge them to get final output.

points to be noted. 1) Hadoop is a frame work for MAP-REDUCE paradigm. 2) large input file doesn't always mean hadop using is practical for your problem. If you don't have some parallelism in your problem, hadoop will probably not help you.

How to process a large file in Hadoop?

1 Answers1