1

I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand & The semicolon is getting picked up as a column separator so I need to way to escape it or replace & with & while loading the dataframe.

As an example, I have the following csv file:

ID;FirstName;LastName
1;Chandler;Bing
2;Ross & Monica;Geller

I load it using the following notebook:

df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test.csv')
df.show()

The result I'm getting is:

+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1| Chandler|    Bing|
|  2|Ross &amp|  Monica|
+---+---------+--------+

Whereas what I'm looking for is:

+---+-------------+--------+
| ID|    FirstName|LastName|
+---+-------------+--------+
|  1|     Chandler|    Bing|
|  2|Ross & Monica|  Geller|
+---+-------------+--------+

I have tried using .option("escape", "&") but that escaping only works on a single character.

Update

I have a hacky workaround using RDDs that works at least for small test files but I'm still looking for a proper solution escapes the string while loading the dataframe.

rdd = sc.textFile('/mnt/input/AMP test.csv')
rdd = rdd.map(lambda x: x.replace('&', '&'))

rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv")

df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv')
df.show()

Kafels
  • 3,864
  • 1
  • 15
  • 32
Connell.O'Donnell
  • 3,603
  • 11
  • 27
  • 61

2 Answers2

1

I think there isn't a way to escape this complex char & only using spark.read.csv and the solution is like you did your "workaround":

  • rdd.map: This function already replaces in all columns the value & by &
  • It is not necessary to save your rdd in a temporary path, just passes it as csv parameter:
rdd = sc.textFile("your_path").map(lambda x: x.replace("&", "&"))

df = spark.read.csv(rdd, header=True, sep=";")
df.show()

+---+-------------+--------+
| ID|    FirstName|LastName|
+---+-------------+--------+
|  1|     Chandler|    Bing|
|  2|Ross & Monica|  Geller|
+---+-------------+--------+
Steven
  • 14,048
  • 6
  • 38
  • 73
Kafels
  • 3,864
  • 1
  • 15
  • 32
1

You can do that with dataframes directly. It helps if you have at least 1 file you know that does not contains any & to retrieve the schema.

Let's assume such a file exists and its path is "valid.csv".

from pyspark.sql import functions as F

# I acquire a valid file without the & wrong data to get a nice schema
schm = spark.read.csv("valid.csv", header=True, inferSchema=True, sep=";").schema


df = spark.read.text("/mnt/input/AMP test.csv")

# I assume you have several files, so I remove all the headers.
# I do not need them as I already have my schema in schm.
header = df.first().value
df = df.where(F.col("value") != header)


# I replace "&" with "&", and split the column
df = df.withColumn(
    "value", F.regexp_replace(F.col("value"), "&", "&")
).withColumn(
    "value", F.split("value", ";")
)

# I explode the array in several columns and add types based on schm defined previously
df = df.select(
    *(
        F.col("value").getItem(i).cast(col.dataType).alias(col.name)
        for i, col in enumerate(schm)
    )
)

here is the result :

df.show()
+---+-------------+--------+
| ID|    FirstName|LastName|
+---+-------------+--------+
|  1|     Chandler|    Bing|
|  2|Ross & Monica|  Geller|
+---+-------------+--------+

df.printSchema()
root
 |-- ID: integer (nullable = true)
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
Steven
  • 14,048
  • 6
  • 38
  • 73