1

I have files on S3 where two columns contain only positive integers which can be of 10^26. Unfortunately, according to AWS Docs Athena only supports values in a range up to 2^63-1 (approx 10^19). So at the moment these column are represented as a string.

When it comes to filtering it is not that big of an issue, as I can use regex. For example, if I want to get all records between 5e^21 and 6e^21 my query would look like:

SELECT *
FROM database.table
WHERE (regexp_like(col_1, '^5[\d]{21}$'))

I have approx 300M rows (approx 12GB in parquet) and it takes about 7 seconds, so performance wise it ok.

However, sometimes I would like to perform some math operation on these two big columns, e.g subtract one big column from another. Casting these records to DOUBLE wouldn't work due to approximation error. Ideally, I would want to stay within Athena. At the moment, I have about 100M rows that are greater then 2^63-1, but this number can grow in a future.

What would be the right way to approach problem of having numerical records that exceed available range? Also what are your thoughts on using regex for filtering? Is there a better/more appropriate way to do it?

Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
  • How/why did the columns ever get stored in scientific notation anyways? Integers, even bigintegers, are normally stored with all their digits in some convenient base like 10 or 16. – President James K. Polk Oct 16 '19 at 19:14
  • This is how historical data arrived from the provider) Initially, I didn't even notice it. It came up when I attempted to repartition data and convert it to parquet with [CTAS queries](https://docs.aws.amazon.com/athena/latest/ug/ctas.html) – Ilya Kisil Oct 16 '19 at 19:33
  • I guess my suspicion is that the data is already only approximate, and possibly very low precision. You would likely lose nothing by converting to double. – President James K. Polk Oct 16 '19 at 19:37

1 Answers1

3

You can cast numbers of the form 5e21 to an approximate 64bit double or an exact numeric 128bit decimal. First you'll need to remove the caret ^, with the replace function. Then a simple cast will work:

SELECT CAST(replace('5e^21', '^', '') as DOUBLE);
 _col0  
--------
 5.0E21 

or

SELECT CAST(replace('5e^21', '^', '') as DECIMAL);
         _col0          
------------------------
 5000000000000000000000 

If you are going to this table often, I would rewrite it the new data type to save processing time.

Dain Sundstrom
  • 2,699
  • 15
  • 14
  • 1
    thanks, don't know why I haven't tried to cast it as decimal) But what would be a good solution when you have to be precise let's say up to 50 decimal places? – Ilya Kisil Oct 17 '19 at 09:15