Spark 2.0.x dump a csv file from a dataframe containing one array of type string

Question

I have a dataframe df that contains one column of type array

df.show() looks like

|ID|ArrayOfString|Age|Gender|
+--+-------------+---+------+
|1 | [A,B,D]     |22 | F    |
|2 | [A,Y]       |42 | M    |
|3 | [X]         |60 | F    |
+--+-------------+---+------+

I try to dump that df in a csv file as follow:

val dumpCSV = df.write.csv(path="/home/me/saveDF")

It is not working because of the column ArrayOfString. I get the error:

CSV data source does not support array string data type

The code works if I remove the column ArrayOfString. But I need to keep ArrayOfString!

What would be the best way to dump the csv dataframe including column ArrayOfString (ArrayOfString should be dumped as one column on the CSV file)

score 35 · Answer 1 · edited Apr 03 '20 at 23:42

35

No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:

import org.apache.spark.sql.functions._
val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
                .write
                .csv(path="/home/me/saveDF")

Hope that helps.

edited Apr 03 '20 at 23:42

meniluca

186
7

answered Aug 31 '18 at 18:12

Eb Abadi

585
5
17

4

This should be the top answer, it’s using an inbuilt function to solve the question. – Ceren Apr 05 '22 at 18:48

score 33 · Accepted Answer · edited Aug 24 '18 at 13:29

33

The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.

Try the following :

import org.apache.spark.sql.functions._

val stringify = udf((vs: Seq[String]) => vs match {
  case null => null
  case _    => s"""[${vs.mkString(",")}]"""
})

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)

or

import org.apache.spark.sql.Column

def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)

edited Aug 24 '18 at 13:29

eliasah

39,588
11
124
154

answered Nov 04 '16 at 15:50

Hello thanks a lot for the answer. I understand what those lines does. However I am a bit confused of the syntax s"""[${vs.mkString(",")}]""" Could you explain me a bit more about the s and the triple """ ? Thanks. – S12000 Nov 05 '16 at 16:59
http://docs.scala-lang.org/overviews/core/string-interpolation.html `"""` because there is embedded `"` – Nov 05 '16 at 17:19
1

Hum I understand better the meaning of "s" thanks to the doc you send me. However I still not understand why 3 quotes. Why can't I write s"[${vs.mkString(",")}]" By the way using 1 quote also works on my side. So why 3 quotes? – S12000 Nov 05 '16 at 18:14

plfrick · Answer 3 · 2020-09-01T15:57:49.493

29

Pyspark implementation.

In this example, change the field column_as_array to column_as_string before saving.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string, StringType())

df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))

Then you can drop the old column (array type) before saving.

df.drop("column_as_array").write.csv(...)

edited Sep 01 '20 at 15:57

answered Oct 06 '17 at 23:32

plfrick

1,109
12
12

I have 2 columns "Antecedent" and "Consequent" which has list as input. How can I modify this code to do the same. – DreamerP Jul 02 '19 at 17:50
4

It worked for me (and one could even automatize it by creating a list `all_mappings = [x for (x,y) in df_p300.dtypes if y == 'map' ]` ). Small detail: the last line of the first block of code should read `df["column_as_array"]` instead of `d["column_as_array"]`. – Lionel Trebuchon Aug 27 '19 at 12:17
1

Can someone fix the `d` to `df` as @LionelTrebuchon mentioned? SO does not allow one character edit, but this example works, but this one typo. – zeh Aug 29 '20 at 02:17
1

character typo fixed, thanks to LionelTrebuchon and @zeh – plfrick Sep 01 '20 at 15:59

Julien Poulin · Answer 4 · 2021-08-26T13:52:25.147

Here is a method for converting all ArrayType (of any underlying type) columns of a DataFrame to StringType columns:

def stringifyArrays(dataFrame: DataFrame): DataFrame = {
  val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)
  colsToStringify.foldLeft(dataFrame)((df, c) => {
    df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
  })
}

Also, it doesn't use a UDF.

score 1 · Answer 5 · answered Dec 21 '17 at 23:37

CSV is not the ideal export format, but if you just want to visually inspect your data, this will work [Scala]. Quick and dirty solution.

case class example ( id: String, ArrayOfString: String, Age: String, Gender: String)

df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv")

score 1 · Answer 6 · answered Aug 03 '20 at 10:35

To answer DreamerP's question (from one of the comments) :

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string, StringType())

df = df.withColumn('Antecedent_as_str', array_to_string_udf(df["Antecedent"]))
df = df.withColumn('Consequent_as_str', array_to_string_udf(df["Consequent"]))
df = df.drop("Consequent")
df = df.drop("Antecedent")
df.write.csv("foldername")

Spark 2.0.x dump a csv file from a dataframe containing one array of type string

6 Answers6

Linked

Related