0

Need to add quotes for all in spark dataframe

Input:

val someDF = Seq(
     |   ("user1", "math","algebra-1","90"),
     |   ("user1", "physics","gravity","70")
     | ).toDF("user_id", "course_id","lesson_name","score")

Actual Output:

+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
|  user1|     math|  algebra-1|   90|
|  user1|  physics|    gravity|   70|
+-------+---------+-----------+-----+

Expected Output:

someDF.show()

+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
|"user1"|  "math" |"algebra-1"| "90"|
|"user1"|"physics"| "gravity" | "70"|
+-------+---------+-----------+-----+
vilalabinot
  • 1,420
  • 4
  • 17

1 Answers1

0

You have two possibilities about this, the first one is adding quotes to the dataset while creating it, like:

sparkSession.sparkContext.parallelize(Seq(
      ("\"user1\"", "\"math\"", "\"algebra-1\"", "\"90\""),
      ("\"user1\"", "\"physics\"", "\"gravity\"", "\"70\"")
    )).toDF("user_id", "course_id", "lesson_name", "score")

which is not that convenient. The second method is concatenating all columns; first we get a list of all columns:

val cols = df1.columns

Then we loop through them and we add quotes before and after the column value:

for (column <- cols) {
  df1 = df1.withColumn(column, concat(lit("\""), col(column), lit("\"")))
}

Final output:

+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
|"user1"|   "math"|"algebra-1"| "90"|
|"user1"|"physics"|  "gravity"| "70"|
+-------+---------+-----------+-----+
vilalabinot
  • 1,420
  • 4
  • 17
  • `withColumn` should not call from loop, which is mentioned in official doc https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame – Pradip Sodha Aug 05 '22 at 12:01
  • I think this is an easy way to understand what is actually going on. For a scalable application, I would use withColumns, but I do not think that's the case for this example as the guy seems to be new. But other than this, I totally agree that it is not recommended to use loops with Spark in production. – vilalabinot Aug 05 '22 at 12:15