I have files on S3 where two columns contain only positive integers which can be of 10^26
. Unfortunately, according to AWS Docs Athena only supports values in a range up to 2^63-1
(approx 10^19
). So at the moment these column are represented as a string.
When it comes to filtering it is not that big of an issue, as I can use regex. For example, if I want to get all records between 5e^21
and 6e^21
my query would look like:
SELECT *
FROM database.table
WHERE (regexp_like(col_1, '^5[\d]{21}$'))
I have approx 300M rows (approx 12GB in parquet) and it takes about 7 seconds, so performance wise it ok.
However, sometimes I would like to perform some math operation on these two big columns, e.g subtract one big column from another. Casting these records to DOUBLE wouldn't work due to approximation error. Ideally, I would want to stay within Athena. At the moment, I have about 100M rows that are greater then 2^63-1
, but this number can grow in a future.
What would be the right way to approach problem of having numerical records that exceed available range? Also what are your thoughts on using regex for filtering? Is there a better/more appropriate way to do it?