-1

Let us say I've a DF created as follows

  val posts = spark.read
    .option("rowTag","row")
    .option("attributePrefix","")
    .schema(Schemas.postSchema)
    .xml("src/main/resources/Posts.xml")

What is the advantage of converting it to a Column using posts.select("Id") over posts.select($"Id")

Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327

3 Answers3

4

df.select operates on the column directly while $"col" creates a Column instance. You can also create Column instances using col function. Now the Columns can be composed to form complex expressions which then can be passed to any of the df functions.

You can also find examples and more usages on the Scaladoc of Column class.

Ref - https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Column

Ashwanth Kumar
  • 667
  • 4
  • 10
2

There is no particular advantage, it's an automatic conversion anyway. But not all methods in SparkSQL perform this conversion, so sometimes you have to put the Column object with the $.

Paul
  • 1,939
  • 1
  • 18
  • 27
  • is there any advantage converting them to Column type? Or in which scenarios we should be converting them? – Aravind Yarram May 25 '19 at 15:47
  • Like I said, when the API accepts both the column object and the column name, such as select(), there is no advantage. When the API does not accept the column name, then you need to use the column object to avoid a compilation error. – Paul May 25 '19 at 18:44
  • Is there any rule of thumb to know when the column class is needed? I have trouble keeping track of when I need to use it and when I don't. – Blaisem Dec 02 '21 at 21:14
1

There is not much difference but some functionalities can be used only using $ with the column name.

Example : When we want to sort the value in this column, without using $ prior to column name, it will not work.

Window.orderBy("Id".desc)

But if you use $ before column name, it works.

Window.orderBy($"Id".desc)

Sarath Subramanian
  • 20,027
  • 11
  • 82
  • 86