I have a large dataset (>40G) which I want to store in S3 and then use Athena for query.
As suggested by this blog post, I could store my data in the following hierarchical directory structure to enable usingMSCK REPAIR
to automatically add partitions while creating table from my dataset.
s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/
However, this requires me to split my dataset into many smaller data files and each will be stored under a nested folder depending on the partition keys.
Although using partition could reduce amount of data to be scanned by Athena and therefore speed up a query, would managing large amount of small files cause performance issue for S3? Is there a tradeoff here I need to consider?