45

I have a dataframe df that contains one column of type array

df.show() looks like

|ID|ArrayOfString|Age|Gender|
+--+-------------+---+------+
|1 | [A,B,D]     |22 | F    |
|2 | [A,Y]       |42 | M    |
|3 | [X]         |60 | F    |
+--+-------------+---+------+

I try to dump that df in a csv file as follow:

val dumpCSV = df.write.csv(path="/home/me/saveDF")

It is not working because of the column ArrayOfString. I get the error:

CSV data source does not support array string data type

The code works if I remove the column ArrayOfString. But I need to keep ArrayOfString!

What would be the best way to dump the csv dataframe including column ArrayOfString (ArrayOfString should be dumped as one column on the CSV file)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
S12000
  • 3,345
  • 12
  • 35
  • 51

6 Answers6

35

No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:

import org.apache.spark.sql.functions._
val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
                .write
                .csv(path="/home/me/saveDF")

Hope that helps.

meniluca
  • 186
  • 7
Eb Abadi
  • 585
  • 5
  • 17
  • 4
    This should be the top answer, it’s using an inbuilt function to solve the question. – Ceren Apr 05 '22 at 18:48
33

The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.

Try the following :

import org.apache.spark.sql.functions._

val stringify = udf((vs: Seq[String]) => vs match {
  case null => null
  case _    => s"""[${vs.mkString(",")}]"""
})

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)

or

import org.apache.spark.sql.Column

def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Hello thanks a lot for the answer. I understand what those lines does. However I am a bit confused of the syntax s"""[${vs.mkString(",")}]""" Could you explain me a bit more about the s and the triple """ ? Thanks. – S12000 Nov 05 '16 at 16:59
  • http://docs.scala-lang.org/overviews/core/string-interpolation.html `"""` because there is embedded `"` –  Nov 05 '16 at 17:19
  • 1
    Hum I understand better the meaning of "s" thanks to the doc you send me. However I still not understand why 3 quotes. Why can't I write s"[${vs.mkString(",")}]" By the way using 1 quote also works on my side. So why 3 quotes? – S12000 Nov 05 '16 at 18:14
29

Pyspark implementation.

In this example, change the field column_as_array to column_as_string before saving.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string, StringType())

df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))

Then you can drop the old column (array type) before saving.

df.drop("column_as_array").write.csv(...)
plfrick
  • 1,109
  • 12
  • 12
  • I have 2 columns "Antecedent" and "Consequent" which has list as input. How can I modify this code to do the same. – DreamerP Jul 02 '19 at 17:50
  • 4
    It worked for me (and one could even automatize it by creating a list `all_mappings = [x for (x,y) in df_p300.dtypes if y == 'map' ]` ). Small detail: the last line of the first block of code should read `df["column_as_array"]` instead of `d["column_as_array"]`. – Lionel Trebuchon Aug 27 '19 at 12:17
  • 1
    Can someone fix the `d` to `df` as @LionelTrebuchon mentioned? SO does not allow one character edit, but this example works, but this one typo. – zeh Aug 29 '20 at 02:17
  • 1
    character typo fixed, thanks to LionelTrebuchon and @zeh – plfrick Sep 01 '20 at 15:59
11

Here is a method for converting all ArrayType (of any underlying type) columns of a DataFrame to StringType columns:

def stringifyArrays(dataFrame: DataFrame): DataFrame = {
  val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)
  colsToStringify.foldLeft(dataFrame)((df, c) => {
    df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
  })
}

Also, it doesn't use a UDF.

Julien Poulin
  • 12,737
  • 10
  • 51
  • 76
1

CSV is not the ideal export format, but if you just want to visually inspect your data, this will work [Scala]. Quick and dirty solution.

case class example ( id: String, ArrayOfString: String, Age: String, Gender: String)

df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv")
Ashish Markanday
  • 1,300
  • 10
  • 11
1

To answer DreamerP's question (from one of the comments) :

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string, StringType())

df = df.withColumn('Antecedent_as_str', array_to_string_udf(df["Antecedent"]))
df = df.withColumn('Consequent_as_str', array_to_string_udf(df["Consequent"]))
df = df.drop("Consequent")
df = df.drop("Antecedent")
df.write.csv("foldername")
NRB
  • 95
  • 1
  • 7