I'm running some sql query against a CSV, generated with tpch-dbgen. I am running it with one thread/task for simplicity, and see the gaps in the timeline as shown in the attached image. Is it disk operations? can this overhead be somehow relaxed or optimized? How can I be sure what exactly taking place there?
Asked
Active
Viewed 48 times
1 Answers
0
This could be a combination of buffering the file input from the distributed filesystem before GPU processing and Spark compressing and writing the task outputs to disk as part of shuffle (it's unclear whether this query has shuffle from the portion of the profile shown).
There are some Java-level NVTX ranges in both the RAPIDS Accelerator and cudf jars that can help provide more visibility. Add
--conf spark.executor.extraJavaOptions="-Dai.rapids.cudf.nvtx.enabled=true"
to the Spark command-line to enable these NVTX ranges which should show up in collected GPU profiles.
See also https://nvidia.github.io/spark-rapids/docs/tuning-guide.html for tips on tuning the RAPIDS Accelerator for Apache Spark.

Jason Lowe
- 11
- 2