-1

I have CSV file as shown:

name,age,languages,experience
'Alice',31,['C++', 'Java'],2
'Bob',34,['Java', 'Python'],2
'Smith',35,['Ruby', 'Java'],3
'David',36,['C', 'Java', 'R']4

While loading the data, by default all the columns are loading as strings.

scala> val df = spark.read.format("csv").option("header",true).load("data.csv")
df: org.apache.spark.sql.DataFrame = [name: string, age: string ... 2 more fields]

scala> df.show()
+-------+---+------------------+----------+
|   name|age|         languages|experience|
+-------+---+------------------+----------+
|'Alice'| 31|   ['C++', 'Java']|         2|
|  'Bob'| 34|['Java', 'Python']|         2|
|'Smith'| 35|  ['Ruby', 'Java']|         3|
|'David'| 36|['C', 'Java', 'R']|         4|
+-------+---+------------------+----------+

scala> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- languages: string (nullable = true)
 |-- experience: string (nullable = true)

So I defined a custom schema as String, Integer, Array, Integer datatypes:

import org.apache.spark.sql.types.{StructField, StructType, StringType, ArrayType, IntegerType}

val custom_schema = new StructType(Array(StructField("name", StringType), StructField("age", IntegerType), StructField("languages", ArrayType(StringType)), StructField("experience", IntegerType)))

When I load the data using the custom schema, it is throwing error

Terminal screenshot after running the command

scala> val df = spark.read.format("csv").option("header",true).schema(custom_schema).load("data.csv")
org.apache.spark.sql.AnalysisException: CSV data source does not support array<string> data type.
  at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$verifySchema$1(DataSourceUtils.scala:67)
  at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$verifySchema$1$adapted(DataSourceUtils.scala:65)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
  at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifySchema(DataSourceUtils.scala:65)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:445)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
  ... 47 elided

How to load the data to spark data frames by making a column as an array?

Prakash
  • 3
  • 3
  • Does this answer your question? [Read Array of Strings as Array in Pyspark from CSV](https://stackoverflow.com/questions/59303043/read-array-of-strings-as-array-in-pyspark-from-csv) – blackbishop Nov 24 '21 at 21:41
  • I am using scala in spark, and this link has used python (pyspark) – Prakash Nov 25 '21 at 04:11

1 Answers1

0

You could transform it to an array after reading it from the file by removing the brackets ([,]) using regexp_replace and splitting the remaining string by commas (,) using split eg..

val df = spark.read.format("csv").option("header",true).load("data.csv")

val transformedDf = df.withColumn("languages",
                         split(
                             regexp_replace(col("languages"),"\\[|\\]",""),
                             ","
                         )
                    )
Prakash
  • 3
  • 3
ggordon
  • 9,790
  • 2
  • 14
  • 27