Fastest way to get exact count of rows for a 100GB CSV file stored on S3

Question

What is the fastest way of getting an exact count of rows for a 100GB CSV file stored on Amazon S3 without using Athena nor any Fargate or EC2 VM? I can't use Athena, because the CSV file isn't clean-enough for it. I can't use Fargates or EC2 VMs, because I need a purely serverless solution. I can't use third-party services like Snowflake (native AWS services only).

Also, 100GB is too large to fit within a Lambda Function's /tmp (limited to 10GB). I could try to run something like DuckDB (or any other streaming database engine) on a Lambda and scan the entire file with a SELECT COUNT(*) FROM "s3://myBucket/myFile.csv" query, but the Lambda is quite likely to timeout, because its read bandwidth from S3 is 100MB/s at best, and it cannot run for more than 15 minutes (900s).

I know the approximate size of the file.

Note: I have an inaccurate estimate of the number of rows provided by AWS Glue Data Catalog's crawler, with an error margin of -50%/+100%. This could be used for some kind of iterative or dichotomous process, but I could not figure any out. For example, I tried adding an OFFSET with a value lower than but close to the number of rows to the aforementioned query, but the Lambda running DuckDB timed out. That was disappointing and somewhat surprising, because a query like SELECT * FROM "s3://myBucket/myFile.csv" LIMIT 10 OFFSET 10000000 worked well.

S3 lets you read a byte range, so you can invoke parallel Lambdas and add up their results. — kdgregory, Dec 03 '22 at 14:05
That's exactly the solution that we're looking at right now. End of lines for CSVs generated by Google Sheets seem to contain `\r\n`, while carriage returns embedded within cells are only `\n`. We need to try with other CSV files. — Ismael Ghalimi, Dec 03 '22 at 14:10
Note the advice [here](https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html): "If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance. GET requests can directly address individual parts; for example, GET ?partNumber=N". So, assuming you have control over how the original file is uploaded, you might benefit from controlling the part size. — jarmod, Dec 03 '22 at 17:08

score 0 · Answer 1 · answered Dec 03 '22 at 14:42

0

The fastest solution is probably to use SelectObjectContent with ScanRange to parallelize the request on chunks of 50MB or so.

answered Dec 03 '22 at 14:42

Ismael Ghalimi

3,515
2
22
25

score 0 · Answer 2 · answered Dec 08 '22 at 15:17

0

Have you tried "AWS S3 select":https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html. It lets you run queries on S3 files. I use the service to get basic insight into any file on S3(Provided it can be queried).

answered Dec 08 '22 at 15:17

Ram Manoj

21
4

Yes. Same as `SelectObjectContent`, as mentioned in other reply. – Ismael Ghalimi Dec 08 '22 at 17:07

Fastest way to get exact count of rows for a 100GB CSV file stored on S3

2 Answers2