Pyspark DataFrame - Escaping &

Question

I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand & The semicolon is getting picked up as a column separator so I need to way to escape it or replace & with & while loading the dataframe.

As an example, I have the following csv file:

ID;FirstName;LastName
1;Chandler;Bing
2;Ross &amp; Monica;Geller

I load it using the following notebook:

df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test.csv')
df.show()

The result I'm getting is:

+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1| Chandler|    Bing|
|  2|Ross &amp|  Monica|
+---+---------+--------+

Whereas what I'm looking for is:

+---+-------------+--------+
| ID|    FirstName|LastName|
+---+-------------+--------+
|  1|     Chandler|    Bing|
|  2|Ross & Monica|  Geller|
+---+-------------+--------+

I have tried using .option("escape", "&") but that escaping only works on a single character.

Update

I have a hacky workaround using RDDs that works at least for small test files but I'm still looking for a proper solution escapes the string while loading the dataframe.

rdd = sc.textFile('/mnt/input/AMP test.csv')
rdd = rdd.map(lambda x: x.replace('&amp;', '&'))

rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv")

df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv')
df.show()

score 1 · Answer 1 · edited Aug 23 '21 at 14:23

I think there isn't a way to escape this complex char & only using spark.read.csv and the solution is like you did your "workaround":

rdd.map: This function already replaces in all columns the value & by &
It is not necessary to save your rdd in a temporary path, just passes it as csv parameter:

rdd = sc.textFile("your_path").map(lambda x: x.replace("&amp;", "&"))

df = spark.read.csv(rdd, header=True, sep=";")
df.show()

+---+-------------+--------+
| ID|    FirstName|LastName|
+---+-------------+--------+
|  1|     Chandler|    Bing|
|  2|Ross & Monica|  Geller|
+---+-------------+--------+

Steven · Accepted Answer · 2021-08-23T14:22:05.747

You can do that with dataframes directly. It helps if you have at least 1 file you know that does not contains any & to retrieve the schema.

Let's assume such a file exists and its path is "valid.csv".

from pyspark.sql import functions as F

# I acquire a valid file without the &amp; wrong data to get a nice schema
schm = spark.read.csv("valid.csv", header=True, inferSchema=True, sep=";").schema


df = spark.read.text("/mnt/input/AMP test.csv")

# I assume you have several files, so I remove all the headers.
# I do not need them as I already have my schema in schm.
header = df.first().value
df = df.where(F.col("value") != header)


# I replace "&amp;" with "&", and split the column
df = df.withColumn(
    "value", F.regexp_replace(F.col("value"), "&amp;", "&")
).withColumn(
    "value", F.split("value", ";")
)

# I explode the array in several columns and add types based on schm defined previously
df = df.select(
    *(
        F.col("value").getItem(i).cast(col.dataType).alias(col.name)
        for i, col in enumerate(schm)
    )
)

here is the result :

df.show()
+---+-------------+--------+
| ID|    FirstName|LastName|
+---+-------------+--------+
|  1|     Chandler|    Bing|
|  2|Ross & Monica|  Geller|
+---+-------------+--------+

df.printSchema()
root
 |-- ID: integer (nullable = true)
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)

Pyspark DataFrame - Escaping &

2 Answers2