If your text files begin/end with a unique sequence of characters, you can first merge them into a single file with s3distcp
(I did this by by setting --targetSize
to a very large number), then use sed
with Hadoop streaming to add in the new lines; in the following example, each file contains a single json (the filenames all begin with 0
), and the sed
command inserts a newline between each instance of }{
:
hadoop fs -mkdir hdfs:///tmpoutputfolder/
hadoop fs -mkdir hdfs:///finaloutputfolder/
hadoop jar lib/emr-s3distcp-1.0.jar \
--src s3://inputfolder \
--dest hdfs:///tmpoutputfolder \
--targetSize 1000000000 \
--groupBy ".*(0).*"
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar \
-D mapred.reduce.tasks=1 \
--input hdfs:///tmpoutputfolder \
--output hdfs:///finaloutputfolder \
--mapper /bin/cat \
--reducer '/bin/sed "s/}{/}\n{/g"'