How to get s3distcp to merge with newlines

Question

I have many millions of small one line s3 files that I'm looking to merge together. I have the s3distcp syntax down, however, I've discovered that after merging the files no newlines are contained in the merged set.

I was wondering if s3distcp includes any option to force a newline in, or is there another way to accomplish this without modifying the source files directly (or copying them and doing the same)

Looks like you need to add new line to the one line file itself. Isn't that an option? — user1452132, Jul 14 '15 at 12:51
Yeah, I was hoping to avoid that and get that "for free" from s3distcp directly, but doesn't sound like I can avoid that unfortunately — isueightynine, Jul 15 '15 at 17:48

score 2 · Answer 1 · answered Aug 28 '15 at 00:52

If your text files begin/end with a unique sequence of characters, you can first merge them into a single file with s3distcp (I did this by by setting --targetSize to a very large number), then use sed with Hadoop streaming to add in the new lines; in the following example, each file contains a single json (the filenames all begin with 0), and the sed command inserts a newline between each instance of }{:

hadoop fs -mkdir hdfs:///tmpoutputfolder/
hadoop fs -mkdir hdfs:///finaloutputfolder/
hadoop jar lib/emr-s3distcp-1.0.jar \
               --src s3://inputfolder \
               --dest hdfs:///tmpoutputfolder \
               --targetSize 1000000000 \
               --groupBy ".*(0).*"
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar \
               -D mapred.reduce.tasks=1 \
               --input hdfs:///tmpoutputfolder \
               --output hdfs:///finaloutputfolder \
               --mapper /bin/cat \
               --reducer '/bin/sed "s/}{/}\n{/g"'

ks maxeed · Answer 2 · 2021-08-30T00:35:39.050

0

I have same probrem and sed command breaks folder structures.
s3distcp is just MapReduce program.
So other way is to write a MapReduce program which is merging and add new line, more behavier you like.

Or you can search such a MapReduce program from internet, GitHub repository, I created one of them github.com/ksmaxeed/s3distcp.

edited Aug 30 '21 at 00:35

answered Aug 28 '21 at 11:30

ks maxeed

1
1

How to get s3distcp to merge with newlines

2 Answers2