2

May be a silly question, but I note that:

val aggDF = df.groupBy("id").pivot("col1")

causes a Job to be invoked. Running under Databricks with Notebook. This is gotten:

(1) Spark Jobs
    Job 4 View     (Stages: 3/3)
       Stage 12:     8/8
       Stage 13:     200/200
       Stage 14:     1/1

I am not aware pivot is an Action from docs.

As per usual I cannot find a suitable reference in the docs to explain this, but there is likely be something to do with that pivot is seen as an Action or calls an aspect of Spark that is an Action.

baitmbarek
  • 2,440
  • 4
  • 18
  • 26
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • 1
    Sharp distinction between action and transformation has been thing of the past for a while now (note that are not even mentioned in SQL docs, and only by accident in SS docs). And even before, many "transformations" where at least partially eager(ish). Pivot is a trivial example (and already [covered by canonical thread](https://stackoverflow.com/a/35676755/10465355)), but all types of actions can be triggered when physical plan is generated (not that it is common in practice). – 10465355 Dec 28 '19 at 18:27
  • @10465355saysReinstateMonica jsonny stuff as well, docs could be better. – thebluephantom Dec 28 '19 at 18:37
  • Beyond basics, which are quite intuitive, such things are hardly something that can be documented - details are fuzzy, as depend on multiple moving factors, and most of the large deployments use some in-house modifications anyway. – 10465355 Jan 08 '20 at 14:58
  • @10465355saysReinstateMonica Made your point – thebluephantom Jan 08 '20 at 15:07

1 Answers1

3

There are two versions of pivot in RelationalGroupedDataset.

If you pass only the columns, Spark has to fetch all the distinct values to generate columns, performing a collect.

The other method is more recommended but requires you to know in advance the possible values to generate columns.

You can take a look at the source code : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

def pivot(pivotColumn: Column): RelationalGroupedDataset

vs

def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset
baitmbarek
  • 2,440
  • 4
  • 18
  • 26