How to access Columnar URL INDEX using Amazon Athena

Question

I am new to AWS and I'm following this tutorial to access Columnar dataset in Common Crawl. I executed this query:

SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC

And I keep getting this error:

Error opening Hive split s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-05/subset=warc/part-00082-248eba37-08f7-4a53-a4b4-d990640e4be4.c000.gz.parquet (offset=0, length=33554432): com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ZSRS4FD2ZTNJY9PV; S3 Extended Request ID: IvDfkWdbDYXjjOPhmXSQD3iVkBiE2Kl1/K3xaFc1JulOhCIcDbWUhnbww7juthZIUm2hZ9ICiwg=; Proxy: null), S3 Extended Request ID: IvDfkWdbDYXjjOPhmXSQD3iVkBiE2Kl1/K3xaFc1JulOhCIcDbWUhnbww7juthZIUm2hZ9ICiwg=

What's the reason? And how do I resolve it?

score 0 · Answer 1 · answered Jan 08 '23 at 15:29

0

You are hitting the request rate limit of S3 since your query is trying to access too many parquet files at the same time. Consider compacting the underlying files into less.

answered Jan 08 '23 at 15:29

Robert Kossendey

6,733
2
12
42

how do I put a limit to scan less gbs of data – Gladiator Jan 08 '23 at 15:43
Is your table partitioned? – Robert Kossendey Jan 08 '23 at 16:10
There was a temporary issue accessing data on s3://commoncrawl/, see [here](https://groups.google.com/g/common-crawl/c/JvAt1PoTY8E/m/yUgWoVS6EgAJ). The table is partitioned given the two conditions in the WHERE clause 300 Parquet files need to be read, but for all except 1 or 2 the statistics in the footer are sufficient. So, the query should run efficiently. – Sebastian Nagel Jan 10 '23 at 10:12

How to access Columnar URL INDEX using Amazon Athena

1 Answers1