Hadoop - basic + streaming guidance required

Question

I have written a few MapReduce programs in Apache Hadoop 0.2.x versions - in simple words, I'm a beginner.

I am attempting to process a large(over 10GB) SegY file on a Linux machine using a software called SeismicUnix

The basic commands that I execute on a Linux machine are listed :

//Read SegY file and convert to custom format (.su file)

segyread tape=input.sgy verbose=1 endian=0 | segyclean >input.su


//PIPE-SEPARATE the processing commands viz. suhilb and suaccor
suhilb | suaccor ntout=1001 sym=0 < Noise1_10.su > output.su


//Create headers for converting back to SegY format
segyhdrs < output.su bfile=binary hfile=header


//Create the final output file in SegY format
segywrite <output.su tape=output.segy buff=1 conv=1 bfile=binary hfile=header

These steps take a long time on a single machine, hence, a Apache Hadoop cluster has been set up to speed up things.

As per my thought process :

Split the source SegY file onto the cluster(so that a small chunk of a large file is available for processing on every node)
Possibly, using Hadoop Streaming, calling SeismicUnix commands to process small chunks on every node
Aggregate the processed files into one large SegY file which will be the output

Technical queries/challenges :

The source SegY file needs to be first loaded onto the HDFS before its available to the different nodes for processing. How shall I do this - create a SequenceFile or something else? The SeismicUnix reads a SegY file, converts it into a custom format and then processes it!
As shown in the second command, different operations(commands) are piped in the order they are wanted to be executed e.g suhilb | suaccor . Now, can this happen in one mapper or do I need to create one mapper for suhilb and feed its output to suaccor - highly confused here
Assuming that the processing is done and each node now output.segy are created(is this assumption correct???), how do I merge those file(totally clueless here)?

I read a bit about Google's FlumeJava thinking of it as a solution but I would like to stick to Hadoop-only i.e no libraries approach as for now.

Apologies in case I have not asked my queries in-depth/tersely - actually, I'm not able to get a clear idea of the design/code !

Suggest to go through `Hadoop - The Definitive Guide` book - your confusions will be cleared :) — Praveen Sripati, Jan 15 '13 at 12:25
@Praveen I'm going through the book but I'm not getting idea about the custom format + streaming in my case. — Kaliyug Antagonist, Jan 16 '13 at 03:47
I'm interested in knowing if you ever got further with this problem? — mortenbpost, Jul 11 '13 at 15:34
@mortenbpost Not really! Actually, the SegY format was getting misled/corrupted when I converted a SegY file to a SequenceFile which was a problem. — Kaliyug Antagonist, Jul 16 '13 at 05:18

score 0 · Answer 1 · answered Jan 15 '13 at 10:18

Answering in points corresponding to your queries ,

If you know what Custom Format the software uses to convert SegY file , you can store files on HDFS using the same format. To load into HDFS you should look into open source tools like Sqoop.
You can perform various operations in sequence using a mapper. Different mappers would thus perform the operations on different chunks of your input in parallel.
To merge the output files try using a reducer which sorts the output.segy based on the keys . A sample key you use can be the name of the file. Thus , all data of the various output files get routed to one reducer and thus one output part -r - 000 file gets generated.

Please confirm if I have interpreted your answer correctly : 1. Though I know the custom format, it will change as the processing software changes so I intend to store the source SegY file as it is on the cluster and let the processing software on each node take its chunk from HDFS and then convert to its own format. Do I need to think about creating a SequenceFile to upload a source file? 2. So one simple map(...)method and one simple reduce(...) method would suffice, right? 3. Till I ain't clear about how to store source SegY,I can't figure out how to reconstr. from part file — Kaliyug Antagonist, Jan 16 '13 at 03:54

Hadoop - basic + streaming guidance required

1 Answers1