s3 awk bash pipeline

Question

Following this question Splitting out a large file.

I would like to pipe calls from an Amazon s3:// bucket containing large gzipped files, process them with an awk command.

Sample file to process

...
  {"captureTime": "1534303617.738","ua": "..."}
...

Script to optimize

aws s3 cp s3://path/to/file.gz - \
 | gzip -d \
 | awk -F'"' '{date=strftime("%Y%m%d%H",$4); print > "splitted."date }'

gzip splitted.*
# make some visual checks here before copying to S3

aws s3 cp splitted.*.gz s3://path/to/splitted/

Do you think I can wrap everything in the same pipeline to avoid writing files locally?

I can use Using gzip to compress files to transfer with aws command to be able to gzip and copy on the fly, but gzipping inside awk would be great.

Thank you.

Am I understanding correctly that you would like to do all this processing without ever creating a single temporary file? I.e. no `splitted.YYYYMMDDHH`? — kvantour, Aug 23 '18 at 12:39
the `gzip splitted.*` line seems to cost time and space. Maybe I can `print` and `gzip` inside awk, without writing the `splitted.YYYYMMDDHH` but `splitted.YYYYMMDDHH.gzip` directly. — Michel Hua, Aug 23 '18 at 13:09
You still haven't told us what `# make some visual checks here before copying to S3` means. Are you visually checking something inside the `splitted.YYYYMMDDHH` or `splitted.YYYYMMDDHH.gzip` files or something else? — Ed Morton, Aug 23 '18 at 14:12
I would like to know if the files were splitted correctly (see if the dates match) and then manually cp to s3 if everything is ok (just looking at the filenames, without viewing their contents). I have like 5 batches of 200Go to do, and I don't want the files to overlap and their names to collide. — Michel Hua, Aug 23 '18 at 15:09
Just puzzled, why is this opened again. I thought this was a duplicate of: https://stackoverflow.com/questions/21698296/awk-gzip-output-to-multiple-files?noredirect=1&lq=1 — kvantour, Aug 24 '18 at 15:16

score 2 · Answer 1 · answered Aug 27 '18 at 14:36

Took me a bit to understand that your pipeline creates one "splitted.date file for each line in the source file. Since shell pipelines operate on byte streams and not files, while S3 operates on files (objects), you must turn your byte stream into a set of files on local storage before sending them back to S3. So, a pipeline by itself won't suffice.

But I'll ask: what's the larger purpose you trying to accomplish?

You're on the path to generating lots of S3 objects, one for each line of your "large gzipped files". Is this using S3 as a key value store? I'll ask if this is the best design for the goal of your effort? In other words, is S3 the best repository for this information or is here some other store (DynamoDB, or another NoSQL) that would be a better solution?

All the best

You guessed right. The overall pipeline wasn't well designed. We are considering consuming the S3 data splitted by hour with Redshift but that's another question — Michel Hua, Aug 27 '18 at 17:54

Michel Hua · Answer 2 · 2018-08-28T06:55:58.640

0

Two possible optimizations :

On large and multiple files it will help to use all the cores to gzip the files, use xargs, pigz or gnu parallel

Gzip with all cores

parallelize S3 upload : https://github.com/aws-samples/aws-training-demo/tree/master/course/architecting/s3_parallel_upload

edited Aug 28 '18 at 06:55

answered Aug 23 '18 at 18:47

Michel Hua

1,614
2
23
44

Glad you solved your problem, but I won't upvote a link only answer. Why not edit your A to show the actual code that enhances your pipeline, and shows what commands you are using to "use all the cores". I will upvote that ;-) Good luck. – shellter Aug 23 '18 at 21:02
I don't see how using all the cores to gzip the files will reduce the number of files your code generates. – Ed Morton Aug 23 '18 at 21:29
It will reduce the time spent running the code. I am trying to optimize both computational cost and space cost. I think the `awk` command can be parallelized too. If I order the script in term of cpu time by order of priority : it's `gzip` then `aws s3 cp`, then `awk`. `aws s3` is network time, it is a bit different though. – Michel Hua Aug 24 '18 at 05:50

s3 awk bash pipeline

2 Answers2