5

I have many millions of small one line s3 files that I'm looking to merge together. I have the s3distcp syntax down, however, I've discovered that after merging the files no newlines are contained in the merged set.

I was wondering if s3distcp includes any option to force a newline in, or is there another way to accomplish this without modifying the source files directly (or copying them and doing the same)

maxymoo
  • 35,286
  • 11
  • 92
  • 119

2 Answers2

2

If your text files begin/end with a unique sequence of characters, you can first merge them into a single file with s3distcp (I did this by by setting --targetSize to a very large number), then use sed with Hadoop streaming to add in the new lines; in the following example, each file contains a single json (the filenames all begin with 0), and the sed command inserts a newline between each instance of }{:

hadoop fs -mkdir hdfs:///tmpoutputfolder/
hadoop fs -mkdir hdfs:///finaloutputfolder/
hadoop jar lib/emr-s3distcp-1.0.jar \
               --src s3://inputfolder \
               --dest hdfs:///tmpoutputfolder \
               --targetSize 1000000000 \
               --groupBy ".*(0).*"
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar \
               -D mapred.reduce.tasks=1 \
               --input hdfs:///tmpoutputfolder \
               --output hdfs:///finaloutputfolder \
               --mapper /bin/cat \
               --reducer '/bin/sed "s/}{/}\n{/g"'
maxymoo
  • 35,286
  • 11
  • 92
  • 119
0

I have same probrem and sed command breaks folder structures.
s3distcp is just MapReduce program.
So other way is to write a MapReduce program which is merging and add new line, more behavier you like.

Or you can search such a MapReduce program from internet, GitHub repository, I created one of them github.com/ksmaxeed/s3distcp.

ks maxeed
  • 1
  • 1