I am trying to use Apache Drill with an S3 bucket, but it is incredibly slow.
I have about 20,000 JSON files. I can get results from them locally in a few seconds, e.g.:
> select count(*) from dfs.`/path/to/my/files/*.json`;
returns after less than 2 seconds.
Trying to run the exact same query on the exact same files in an S3 bucket is failing to complete even after 10 minutes:
> select count(*) from s3.`releases`;
Why is this? I thought the whole point of Drill was that it was fast on big datasets.
My S3 connection itself is OK, e.g. SHOW files
shows me my available folders just fine in a reasonable amount of time, and there's nothing wrong with my network connection either.