I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand &
The semicolon is getting picked up as a column separator so I need to way to escape it or replace &
with &
while loading the dataframe.
As an example, I have the following csv file:
ID;FirstName;LastName
1;Chandler;Bing
2;Ross & Monica;Geller
I load it using the following notebook:
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test.csv')
df.show()
The result I'm getting is:
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Chandler| Bing|
| 2|Ross &| Monica|
+---+---------+--------+
Whereas what I'm looking for is:
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
I have tried using .option("escape", "&")
but that escaping only works on a single character.
Update
I have a hacky workaround using RDDs that works at least for small test files but I'm still looking for a proper solution escapes the string while loading the dataframe.
rdd = sc.textFile('/mnt/input/AMP test.csv')
rdd = rdd.map(lambda x: x.replace('&', '&'))
rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv")
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv')
df.show()