How to handle escape characters in pyspark. Trying to replace escape character with NULL

Question

I'm trying to replace a escape character with NULL in pyspark dataframe. Data in dataframe looks like below

Col1|Col2|Col3 
1|\026\026|026|abcd026efg.

Col2 is a garbage data and trying to replace with NULL. Tried replace and regex_replace functions to replace '\026' with Null value, because of escape character (" \ "), data is not replaced with NULL value.

 replace(col2, "026",  'abcd') 
 replace(Col2, "\026",  'abcd')

Finally,

I want my data as

Col1|Col2|Col3 
1|NULL|026|abcd026efg.

Highly appreciate for thoughts to resolve this scenario.

Thanks -EVR

enter image description here

wwnde · Accepted Answer · 2022-03-07T03:58:51.747

1

Use replace all digits and preceding non digits

 import pyspark.sql.functions as F
 df.withColumn('col2',F.regexp_replace('col2','\D\d+',None)).show()

+----+----+-----------+
|col1|col2|       col3|
+----+----+-----------+
|   1|null|abcd026efg.|
+----+----+-----------+

edited Mar 07 '22 at 03:58

answered Mar 07 '22 at 03:07

wwnde

26,119
6
18
32

1

Thanks for your reply. got an error "name 'F' is not defined". do we need to import any library, can you please give me the snippet of the code – EVR Mar 07 '22 at 03:41
1

Hahh.. IT WORKED !!!!!!! Thanks for you help.... Can you brief me what's happening with '\026' and explain what does '\D\d+' means – EVR Mar 07 '22 at 05:50
I am using pythons regular expressions n. This is how it works \D for non digits \d for digits + Greedily match everything on the left – wwnde Mar 07 '22 at 06:25
My data is little more complex than earlier, '\026' is randomly spreadout through all the columns and I have replace to '\026' with NULL across all columns below is my sample input data col1|col2|col3 1|\026\026|abcd026efg 2|\026\026|\026\026\026 3|ad026eg|\026\026 4|ad026eg|xyad026 and, my out data should be col1|col2|col3 1|NULL|abcd026efg 2|NULL|NULL 3|ad026eg|NULL 4|ad026eg|xyad026 – EVR Mar 07 '22 at 06:40

How to handle escape characters in pyspark. Trying to replace escape character with NULL

1 Answers1

Linked