1

I am new to Rapids and I have trouble understanding the supported operations.

I have data in following format:

+------------+----------+
|        kmer|source_seq|
+------------+----------+
|TGTCGGTTTAA$|         4|
|ACCACCACCAC$|         8|
|GCATAATTTCC$|         1|
|CCGTCAAAGCG$|         7|
|CCGTCCCGTGG$|         6|
|GCGCTGTTATG$|         2|
|GAGCATAGGTG$|         5|
|CGGCGGATTCT$|         0|
|GGCGCGAGGGT$|         3|
|CCACCACCAC$A|         8|
|CACCACCAC$AA|         8|
|CCCAAAAAAAAA|         0|
|AAGAAAAAAAAA|         5|
|AAGAAAAAAAAA|         0|
|TGTAAAAAAAAA|         0|
|CCACAAAAAAAA|         8|
|AGACAAAAAAAA|         7|
|CCCCAAAAAAAA|         0|
|CAAGAAAAAAAA|         5|
|TAAGAAAAAAAA|         0|
+------------+----------+

And to I am trying to find out which "kmer"s have which "source_seq"'s, using the following code:

val w = Window.partitionBy("kmer")
x.withColumn("source_seqs", collect_list("source_seq").over(w))

// Result is something like this:
+------------+----------+-----------+                                           
|        kmer|source_seq|source_seqs|
+------------+----------+-----------+
|AAAACAAGACCA|         2|        [2]|
|AAAACAAGCAGC|         4|        [4]|
|AAAACCACGAGC|         3|        [3]|
|AAAACCGCCAAA|         7|        [7]|
|AAAACCGGTGTG|         1|        [1]|
|AAAACCTATATC|         5|        [5]|
|AAAACGACTTCT|         6|        [6]|
|AAAACGCGCAAG|         3|        [3]|
|AAAAGGCCTATT|         7|        [7]|
|AAAAGGCGTTCG|         3|        [3]|
|AAAAGGCTGTGA|         1|        [1]|
|AAAAGGTCTACC|         2|        [2]|
|AAAAGTCGAGCA|         7|     [7, 0]|
|AAAAGTCGAGCA|         0|     [7, 0]|
|AAAATCCGATCA|         0|        [0]|
|AAAATCGAGCGG|         0|        [0]|
|AAAATCGTTGAA|         7|        [7]|
|AAAATGGACAAG|         1|        [1]|
|AAAATTGCACCA|         3|        [3]|
|AAACACCGCCGT|         3|        [3]|
+------------+----------+-----------+

The Spark Rapids supported operators documentation mentions collect_list being supported only by windowing, which is what I am doing in my code as far as I know.

However, looking at the query plan, it is easy to see that the collect_list is not executed by the GPU:

scala> x.withColumn("source_seqs", collect_list("source_seq").over(w)).explain
== Physical Plan ==
Window [collect_list(source_seq#302L, 0, 0) windowspecdefinition(kmer#301, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS max_source#658], [kmer#301]
+- GpuColumnarToRow false
   +- GpuSort [kmer#301 ASC NULLS FIRST], false, RequireSingleBatch, 0
      +- GpuCoalesceBatches RequireSingleBatch
         +- GpuShuffleCoalesce 2147483647
            +- GpuColumnarExchange gpuhashpartitioning(kmer#301, 200), ENSURE_REQUIREMENTS, [id=#1496]
               +- GpuFileGpuScan csv [kmer#301,source_seq#302L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/cloud-user/phase1/example/1620833755/part-00000], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<kmer:string,source_seq:bigint>

Unlike a similar query with different function, where we can see the windowing executed with GPU:

scala> x.withColumn("min_source", min("source_seq").over(w)).explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuWindow [gpumin(source_seq#302L) gpuwindowspecdefinition(kmer#301, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS max_source#648L], [kmer#301], false
   +- GpuSort [kmer#301 ASC NULLS FIRST], false, RequireSingleBatch, 0
      +- GpuCoalesceBatches RequireSingleBatch
         +- GpuShuffleCoalesce 2147483647
            +- GpuColumnarExchange gpuhashpartitioning(kmer#301, 200), ENSURE_REQUIREMENTS, [id=#1431]
               +- GpuFileGpuScan csv [kmer#301,source_seq#302L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/cloud-user/phase1/example/1620833755/part-00000], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<kmer:string,source_seq:bigint>

Am I understanding the supported operations documentation wrong somehow, or have I written the code in a wrong way? Any help for this would be appreciated.

3 Answers3

2

Kenny. May I please know what version of the rapids-4-spark plugin you're using, and the version of Spark?

The initial GPU implementation of COLLECT_LIST() was disabled by default because its behaviour did not match Spark's, w.r.t null values. (The GPU version kept nulls in the aggregated array rows, while Spark removed them.) Edit: The behaviour was corrected in the 0.5 release.

If you have no nulls in your aggregation column (and are using rapids-4-spark 0.4), you might try enabling the operator by setting spark.rapids.sql.expression.CollectList=true.

In general, one can examine the reason why an operator didn't run on the GPU by setting spark.rapids.sql.explain=NOT_ON_GPU. That should print the reason out to console.

If you still experience difficulty or incorrect behaviour with the rapids-4-spark plugin, please feel free to raise a bug on the project's GitHub. We'd be happy to investigate further.

Mithun RK
  • 21
  • 3
  • Hey, thanks for the help, it indeed was a problem with version, upgrading from 0.4 to 0.5 fixed the issue. As far as I can see, `collect_set` is still not supported however? – Kenny Abinski May 21 '21 at 09:00
2

Yes as Mithun mentioned, the spark.rapids.sql.expression.CollectList started to be true starting from 0.5 release. However it is false in 0.4 release: https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/docs/configs.md

Here is the plan i tested on 0.5+ version:

val w = Window.partitionBy("name")
val resultdf=dfread.withColumn("values", collect_list("value").over(w))
resultdf.explain

== Physical Plan ==
GpuColumnarToRow false
+- GpuWindow [collect_list(value#134L, 0, 0) gpuwindowspecdefinition(name#133, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS values#138], [name#133], false
   +- GpuCoalesceBatches RequireSingleBatch
      +- GpuSort [name#133 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@28e73bd1
         +- GpuShuffleCoalesce 2147483647
            +- GpuColumnarExchange gpuhashpartitioning(name#133, 200), ENSURE_REQUIREMENTS, [id=#563]
               +- GpuFileGpuScan csv [name#133,value#134L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/tmp/df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:string,value:bigint>
Hao
  • 36
  • 1
0

collect_set for aggregation and windowing will be supported in the upcoming 21.08 release (RAPIDS Spark is moving to calendar versioning).