7

I have a large dataset (>40G) which I want to store in S3 and then use Athena for query.

As suggested by this blog post, I could store my data in the following hierarchical directory structure to enable usingMSCK REPAIR to automatically add partitions while creating table from my dataset.

s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/

However, this requires me to split my dataset into many smaller data files and each will be stored under a nested folder depending on the partition keys.

Although using partition could reduce amount of data to be scanned by Athena and therefore speed up a query, would managing large amount of small files cause performance issue for S3? Is there a tradeoff here I need to consider?

panc
  • 817
  • 2
  • 14
  • 30
  • Hi @panc, did you finally implement this approach? I'm also deciding how to partition my data and one option would be to do it with a minute granularity, which results in many small files. On the tests I've done (just 1 GB of data), having small files obviously reduced the amount of scanned data but increased a lot the time to answer simple queries, so I'd say that it matters. – Gonzalo Solera Oct 22 '18 at 12:12
  • @Gonzalo we did not using this approach. We didn’t even test as what you did. – panc Oct 23 '18 at 13:08
  • Okay thanks, I added an answer to help other people with the same question based on my findings. – Gonzalo Solera Oct 23 '18 at 15:23

2 Answers2

4

Yes, you may experience an important decrease of efficiency with small files and lots of partitions.

Here there is a good explanation and suggestion on file sizes and number of partitions, which should be larger than 128 MB to compensate the overhead.

Also, I performed some experiments in a very small dataset (1 GB), partitioning my data by minute, hour and day. The scanned data decreases when you make the partitions smaller, but the time spent on the query will increase a lot (40 times slower in some experiments).

Gonzalo Solera
  • 1,182
  • 12
  • 26
2

I will try to get into it without veering too much into the realm of opinion.

For the use cases which I have used Athena, 40 GB is actually a very small dataset by the standards of what the underlying technology (Presto) is designed to handle. According to the Presto web page, Facebook uses the underlying technology to query their 300 PB data warehouse. I routinely use it on datasets between 500 GB and 1 TB in size.

Considering the underlying S3 technology, S3 was used to host Dropbox and Netflix, so I doubt most enterprises could come anywhere near taxing the storage infrastructure. Where you may have heard about performance issues and S3 relates to websites storing multiple, small, pieces of static content on many files scattered across S3. In this case, a delay in retrieving one of these small pieces of content might affect user experience on the larger site.

Related Reading: Presto

Zerodf
  • 2,208
  • 18
  • 26