14

Consider a scenario where Spark (or any other Hadoop framework) reads a large (say 1 TB) file from S3. How does multiple spark executors read the very large file in parallel from S3. In HDFS this very large file will be distributed across multiple nodes with each node having a block of data. In object storage I presume this entire file will be in single node (ignoring replicas). This should drastically reduce the read throughput/performance.

Similarly large file writes should also be much faster in HDFS than S3 because writes in HDFS would be spread across multiple hosts whereas all the data has to go through one host (ignoring replication for brevity) in S3.

so does this mean the performance of S3 is significantly worse when compared to HDFS in the big data world.

rogue-one
  • 11,259
  • 7
  • 53
  • 75
  • You're forgetting where your cluster is located. If you're running in AWS, then you might not see much penalty, vs running in your own datacenter. – tk421 Jan 15 '19 at 19:19
  • @tk421 I am comparing between writing data to HDFS cluster in AWS and S3 in AWS.. I am only talking about what is the performance penalty of S3 not having fully parallel writes.. – rogue-one Jan 15 '19 at 19:36
  • https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html says 6x but I imagine real life workloads it might be less significant especially if most of the time is spent in data processing vs I/O. – tk421 Jan 15 '19 at 19:52
  • S3 is slower on all counts from what I see, but for reads you do tgat once and then in memory unless spill. Use SSDs for spilling. – thebluephantom Jan 15 '19 at 21:19

1 Answers1

19

Yes, S3 is slower than HDFS. but it's interesting to look at why, and how to mitigate the impact. Key thing: if you are reading a lot more data than writing, then read performance is critical; the S3A connector in Hadoop 2.8+ really helps there, as it was tuned for reading Parquet/ORC files based on traces of real benchmarks. Write performance also suffers, and the more data you generate the worse it gets. People complain about that, when they should really be worrying about the fact that without special effort, you may actually end up with invalid output. That's generally the more important issue -just less obvious.

Read performance

Reading from S3 suffers due to

  • bandwidth between S3 and your VM. The more you pay for an EC2 VM, the more network bandwidth you get, the better
  • latency of HEAD/GET/LIST requests, especially all those used in the work to make the object store look like a filesystem with directories. This can particularly hurt the partitioning phase of a query, when all the source files are listed and those to actually read identified.
  • Cost of seek() being awful if the HTTP connection for a read is aborted and a new one renegotiated. Without a connector which has optimised seek() for this, ORC and Parquet input suffers badly. the s3a connector in Hadoop 2.8+ does precisely this if you set fs.s3a.experimental.fadvise to random.

Spark will split up work on file if the format is splittable, and whatever compression format is used is also splittable (gz isn't, snappy is). It will do it on block size, which is something you can configure/tune for a specific job (fs.s3a.block.size). If > 1 client reads the same file, then yes, you get some overload of the disk IO to that file, but generally its minor compared to the rest. One little secret: for multipart uploaded files then reading separate parts seems to avoid this, so upload and download with the same configured block size.

Write Performance

Write performance suffers from

  • caching of some/many MB of data in blocks before upload, with the upload not starting until the write is completed. S3A on hadoop 2.8+: set fs.s3a.fast.upload = true.
  • Network upload bandwidth, again a function of the VM type you pay for.

Commit performance and correctness

When output is committed by rename() of the files written to a temporary location, the time to copy each object to its final path is 6-10 MB/S.

A bigger issue is that it very bad at handling inconsistent directory listings or failures of tasks during the commit process. You cannot safely use S3 as a direct destination of work with the normal rename-by-commit algorithm without something to give you a consistent view of the store (consistent emrfs, s3mper, s3guard).

For maximum performance and safe committing of work, you need an output committer optimised for S3. Databricks have their own thing there, Apache Hadoop 3.1 adds the "S3A output committer". EMR now apparently has something here too.

See A zero rename committer for the details on that problem. After which, hopefully, you'll either move to a safe commit mechanism or use HDFS as a destination of work.

stevel
  • 12,567
  • 1
  • 39
  • 50
  • So, Good post for sure, but I cannot glean if a large file can be read by N executors in Spark. Your second point alludes to that... – thebluephantom Jul 08 '19 at 19:59
  • @thebluephantom its a bit disappointing to have seen you go through so many of my posts and critique them. I did discuss it in an earlier post and opt not to provide duplicate detail here. Sorry – stevel Jul 09 '19 at 17:51
  • Really, well I was looking into this topic in quite some detail, but could find only general stuff except for yours. When I googled you came up a lot, so nothing untoward meant, but you seem to be an expert. Not sure what I said that was wrong to be quite honest. And also I did not note initially they were all you. In any event not totally clear on the topic still. Sorry if it makes you feel better. – thebluephantom Jul 09 '19 at 18:16
  • 2
    Noted. Returning to the question "are large files splittable?", the answer is, as I've tried to explain else where: yes is the format and the compression allow it; it'll be split on the filesystem block size, which for s3a is configurable in the client – stevel Jul 11 '19 at 10:29
  • Also noted, Steve. – thebluephantom Jul 11 '19 at 11:28
  • @stevel Great answer! Are these problems still persisting. Has the read and write speeds finally caught up with HDFS? – pacman May 05 '23 at 05:28
  • write? nope. rename? nope. read: vectored io is in hadoop 3.3.5 and with modified parquet/orc, stripe read performance is way better. you will need to patch those libs yourself though as they still use hadoop 2 apis in the apache branches. – stevel May 05 '23 at 10:46
  • also, you are still limited to 3500 writes, 5000 reads per second, a single bulk delete request of 1000 objects counts as 1000 writes, etc etc. s3 is not a filesystem, and classic filesytem api calls on directories which we have to mimic hurt. fix: dont do that – stevel May 05 '23 at 10:49