1

I'm dealing with a dataset that contains number that are in the 10*6 to 10**80 scale. The value column that holds this data is of string type.

One of the common queries performed is a sum across the value column, for 100,000+ rows. However, there are strict requirements on precision so the double conversion that Snowflake performs causes it to lose precision.

Is it possible to create a Java UDF that performs an aggregate sum in the select part of a query? I tried looking up the documentation and couldn't really find a good example. The closest thing I found was UDTFs but it doesn't look like I can call them in the select part of a query.

If not, if anyone knows of strategies to deal with datasets with numbers much larger than the database can handle while maintaining data accuracy, that would be really helpful too.

Leech
  • 11
  • 1
  • 3

1 Answers1

1

A UDTF is the right answer, until Snowflake supports the type of UDFs you are asking for.

This Java UDTF does the job:

create or replace function add_java_udtf(id string, x string)
returns table(id string, big_number string)
language java
handler='MyClass'
as
$$
import java.util.stream.Stream;
import java.math.BigDecimal;

class OutputRow {
    public final String id;
    public final String big_number;
    
    public OutputRow(String id, String outputValue) {
        this.id = id;
        this.big_number = outputValue;
    }
}
class MyClass {
    private BigDecimal _x;
    private String _id;
    
    public MyClass() {
        _x = new BigDecimal(0);
    }
    
    public static Class getOutputClass() {
        return OutputRow.class;
    }
    
    public Stream<OutputRow> process(String id, String inputValue) {
        _id = id;
        _x = _x.add(new BigDecimal(inputValue));
        // Return nothing until whole partition computed.
        return Stream.empty(); 
    }
    
    public Stream<OutputRow> endPartition() {
        return Stream.of(new OutputRow(_id, _x.toString()));
    }
}   
$$;

Sample use:

with data(id, n) as (select $1, $2::string from values ('a', 1),('b', 2),('a', 5))

select r.id, r.big_number
from data, table(add_java_udtf(id, n) over(partition by id)) r
;

enter image description here

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325