3

I am trying to use Apache Drill with an S3 bucket, but it is incredibly slow.

I have about 20,000 JSON files. I can get results from them locally in a few seconds, e.g.:

> select count(*) from dfs.`/path/to/my/files/*.json`;

returns after less than 2 seconds.

Trying to run the exact same query on the exact same files in an S3 bucket is failing to complete even after 10 minutes:

> select count(*) from s3.`releases`;

Why is this? I thought the whole point of Drill was that it was fast on big datasets.

My S3 connection itself is OK, e.g. SHOW files shows me my available folders just fine in a reasonable amount of time, and there's nothing wrong with my network connection either.

Richard
  • 62,943
  • 126
  • 334
  • 542
  • 1
    S3 is not a file system! – Henry Jul 04 '17 at 14:54
  • I know S3 is not a file system. However, from the Drill docs, I assumed I could use it as a fast data source for Drill - but maybe not? – Richard Jul 04 '17 at 15:53
  • why did you expect it will be _fast data source_ ? – Frederic Henri Jul 04 '17 at 16:30
  • I'm an idiot, I guess. I assumed Drill would do some magic to make it fast. – Richard Jul 04 '17 at 18:12
  • you're certainly not an idiot - the s3 storage plugin for drill is nice but by no means its fast, specially if you have to run against many files in S3 – Frederic Henri Jul 04 '17 at 21:12
  • @FrédéricHenri thanks for the reassurance :) Do you know if there are any fast in-cloud storage options for Drill? I like Drill because it doesn't require you to specify a schema upfront, but my data needs to live in the cloud rather than locally. – Richard Jul 04 '17 at 23:04
  • 2
    will continue to say AWS Athena is worth looking in such cases – Frederic Henri Jul 05 '17 at 05:49
  • @FrédéricHenri sadly Athena doesn't work for me, I need something that's schema-on-discovery rather than needing to specify the schema upfront. – Richard Jul 05 '17 at 09:34
  • AWS Glue crawlers can help to auto-discover schema in such cases – madhead Sep 04 '18 at 22:06

1 Answers1

2

its not a direct answer to your question but you should look at athena if you want to query on s3 bucket and you have large dataset

Frederic Henri
  • 51,761
  • 10
  • 113
  • 139