I have written a few MapReduce programs in Apache Hadoop 0.2.x versions - in simple words, I'm a beginner.
I am attempting to process a large(over 10GB) SegY file on a Linux machine using a software called SeismicUnix
The basic commands that I execute on a Linux machine are listed :
//Read SegY file and convert to custom format (.su file)
segyread tape=input.sgy verbose=1 endian=0 | segyclean >input.su
//PIPE-SEPARATE the processing commands viz. suhilb and suaccor
suhilb | suaccor ntout=1001 sym=0 < Noise1_10.su > output.su
//Create headers for converting back to SegY format
segyhdrs < output.su bfile=binary hfile=header
//Create the final output file in SegY format
segywrite <output.su tape=output.segy buff=1 conv=1 bfile=binary hfile=header
These steps take a long time on a single machine, hence, a Apache Hadoop cluster has been set up to speed up things.
As per my thought process :
- Split the source SegY file onto the cluster(so that a small chunk of a large file is available for processing on every node)
- Possibly, using Hadoop Streaming, calling SeismicUnix commands to process small chunks on every node
- Aggregate the processed files into one large SegY file which will be the output
Technical queries/challenges :
- The source SegY file needs to be first loaded onto the HDFS before its available to the different nodes for processing. How shall I do this - create a SequenceFile or something else? The SeismicUnix reads a SegY file, converts it into a custom format and then processes it!
- As shown in the second command, different operations(commands) are piped in the order they are wanted to be executed e.g suhilb | suaccor . Now, can this happen in one mapper or do I need to create one mapper for suhilb and feed its output to suaccor - highly confused here
- Assuming that the processing is done and each node now output.segy are created(is this assumption correct???), how do I merge those file(totally clueless here)?
I read a bit about Google's FlumeJava thinking of it as a solution but I would like to stick to Hadoop-only i.e no libraries approach as for now.
Apologies in case I have not asked my queries in-depth/tersely - actually, I'm not able to get a clear idea of the design/code !