2

I have a Parquet file generated using the parquet-avro library, where one of the field has primitive double array, created using the following schema type:

Schema.createArray(Schema.create(Schema.Type.DOUBLE))

I read this parquet data from Spark and applying a UDAF (User Defined Aggregate Function) on it. Within the UDAF org.apache.spark.sql.expressions.UserDefinedAggregateFunction, I am trying to access this field from org.apache.spark.sql.Row object, which is passed as a parameter to the function public void update(MutableAggregationBuffer mutableAggBuff, Row dataRow). However, I am unable to access the primitive double array, instead what I get back is an array of Double[] which is the boxed object representation of the primitive double. This is a very expensive Object conversion of the primitive double array data.

When I retrieve the double array, I get the boxed java.lang.Double array, instead of the primitive double array. Somewhere in the parquet reader code, the primitive array is getting converted to the memory-inefficient Double object array. How do I prevent this costly conversion, and get the primitive double array intact? I can write the code and convert it back to primitive array, but the Double objects are already created and it's putting GC pressure on the VM.

The only API's on the org.apache.spark.sql.Row are:

// This list I can cast as Double type later
List myArrList = row.getList(0); 
WrappedArray wr = row.getAs(0);

We need a way to get the primitive double[] array without any further conversions. For example:

WrappedArray<scala.Double> wr = row.getAs(0);
double[] myPrimArray = wr.array();

Questions:

  1. Can I customize the Hadoop-parquet reader, such that we can read the double array as primitive double array?
  2. Does Spark/Parquet-Hadoop Reader has any way to do this without custom code?
stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
jcools
  • 71
  • 4

0 Answers0