scala rapids using an opaque UDF for a single column dataframe that produces another column

Question

I am trying to acquaint myself with RAPIDS Accelerator-based computation using Spark (3.3) with Scala. The primary contention in being able to use GPU appears to arise from the blackbox nature of UDFs. An automatic solution would be the Scala UDF compiler. But it won't work with cases where there are loops.

Doubt: Would I be able to get GPU contribution if my dataframe has only one column and produces another column, as this is a trivial case. If so, at least in some cases, even with no change in Spark code, the GPU performance benefit can be attained, even in case where the size of data is much higher than GPU memory. This would be great as sometimes it would be easy to simply merge all columns into one making a single column of WrappedArray using concat_ws that a UDF can simply convert into an Array. For all practical purposes to the GPU then the data is already in columnar fashion and only negligible overhead for row (on CPU) to column (on GPU) needs to be done.The case I am referring to would look like:

val newDf = df.withColumn(colB, opaqueUdf(col("colA")))

Resources: I tried to find good sources/examples to learn Spark-based approach for using RAPIDS, but it seems to me that only Python-based examples are given. Is there any resource/tutorial that gives some sample examples in coversion of Spark UDFs to make them RAPIDS compatible.

abellina · Accepted Answer · 2022-07-28T16:15:11.490

Yes @Quiescent, you are right. The Scala UDF -> Catalyst compiler can be used for simple UDFs that have a direct translation to Catalyst. Supported operations can be found here: https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html. Loops are definitely not supported in this automatic translation, because there isn't a direct expression that we can translate it to.

It all depends on how heavy opaqueUdf is, and how many rows are in your column. The GPU is going to be really good if there are many rows and the operation in the UDF is costly (say it's doing many arithmetic or string operations successively on that column). I am not sure why you want to "merge all columns into one", so can you clarify why you want to do that? On the conversion to Array, is that the purpose of the UDF, or are you wanting to take in N columns -> perform some operation likely involving loops -> produce an Array?
Another approach to accelerating UDFs with GPUs is to use our RAPIDS Accelerated UDFs. These are java or scala UDFs that you implement purposely, and they use the cuDF API directly. The Accelerated UDF document also links to our spark-rapids-examples repo, which has information on how to write Java or Scala UDFs in this way, please take a look there as well.

Thank you and your understanding of my problem is correct. My dataframe is 100Mx10K and the operations on each cell are light weight. I am about to try RAPIDS. From the link I posted suggests that the GPU may not be engaged due to loop. Nevertheless as there is only one self-sufficient cell computation, I may not only benefit from GPU, but also from overhead of translation between rows and columns between the CPU and GPU requirements. The computation is the form of `row.map(r_i->output.append(udf(r1,r2,r3,r_i))).mkString(",")`. — Quiescent, Jul 30 '22 at 20:24

scala rapids using an opaque UDF for a single column dataframe that produces another column

1 Answers1