1

I am taking my first steps in the Azure Databricks world and therefore I have to learn how to use SparkR.

[I am coming from data.table]

Although I have read a lot of documentation, I think something escapes me on SparkDataFrame.

To create a new column, I learned that we can do something like :

sdf$new <- sdf$old * 0.5

But if I want to use a basic function, I got an error and I can't figure out why :

sdf <- sql("select * from database.table")
sdf$new <- strsplit(sdf$old, "-")[1]

Error in strsplit((sdf$old), "-") : 
  non-character argument
Some(<code style = 'font-size:10p'> Error in strsplit((sdf$old), &quot;-&quot;): non-character argument </code>)

What am I missing ?

Thanks.

Discus23
  • 471
  • 2
  • 11

1 Answers1

1

Instead of strsplit you need to use Spark specific functions that you can find in the Spark R API documentation. Specifically, you need to use split_string function, combined with getItem function (please note that you need to specify L to force number be an integer):

new_df <- withColumn(sdf, "new_id", getItem(split_string(sdf$old, ","), 0L))
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Your detailed explanations help me see more clearly. I had started to discover the Spark specific functions, but I believe that I would never have found ```getItem```. SparkR seems like a new language to me ! – Discus23 Nov 29 '21 at 19:55
  • 1
    For introduction into Spark, I recommend free book - Learning Spark, 2ed - you can get it from Databricks site – Alex Ott Nov 29 '21 at 20:08