1

I have an EBS (elastic block storage) on AWS with large amount of 25GB files (1000). I would like to calculate a SHA256 sum for each file.

Which EC2 instance would be the best for such task (cpu optimized? many cores? storage optimized?). Can I somehow hash the files in parallel? How can I optimize performance?

MLu
  • 24,849
  • 5
  • 59
  • 86
ECII
  • 215
  • 1
  • 3
  • 7
  • Throw some spot instances at it. It shouldn't take that long even with micro instances. – Michael Hampton Aug 12 '20 at 17:30
  • You can't hash one file in parallel, but you can hash different files in parallel. You don't need storage at all. I would estimate that you want as many cores as possible, but I don't know whether there's a limit on EBS bandwidth. Some instances are EBS optimized. – user253751 Aug 12 '20 at 17:49
  • The question is not clear on this, but I think it's 25TB of files total. Initially I thought it was 25GB of files the way it's worded. – Tim Aug 13 '20 at 07:47

1 Answers1

2

You will be struggling with EBS throughput.

Smaller m5 / m5a /m6g instances have up to 4,750 Mbps EBS throughput = ca 600 MB/s max. Larger instances like m5.24xlarge can go up to 19,000 Mbps or 2.4 GB/s. But only if your EBS volume can handle it, ie it will probably have to be IOPS-optimised volume (io1 type) to sustain this throughput.

That means your 25TB of data (1000 files x 25 GB each) can be read from the EBS in somewhere between 3 hours and 12 hours in ideal conditions. In reality it will probably be slower. And that's just reading the files.

The CPU doesn't really matter - any CPU can do sha256 at this speed so use a cheaper architecture like M6g (ARM based) and if it comes with multiple cores (e.g. m6g.xlarge with 4 CPU cores) you can hash 4 files in parallel. Though that may not reduce the time 4x as you would expect because of the EBS throughput bottleneck.

I would buy m6g.4xlarge as a Spot Instance and expect it will take a day to hash the files. If your EBS is in a region where M6g (ARM) is not available use M5a (AMD) - it's cheaper than the M5 (Intel) and still powerful enough for the hashing.

Hope that helps :)

MLu
  • 24,849
  • 5
  • 59
  • 86
  • thank you. Does hashing use automatically all CPUs? Or do I need to hash files in parallel? – ECII Aug 13 '20 at 15:08
  • 1
    @ECII AFAIK sha256 uses single thread only. Hash more files in parallel to utilise all CPU cores. – MLu Aug 13 '20 at 21:31