2

I'm trying to replace a escape character with NULL in pyspark dataframe. Data in dataframe looks like below

Col1|Col2|Col3 
1|\026\026|026|abcd026efg. 

Col2 is a garbage data and trying to replace with NULL. Tried replace and regex_replace functions to replace '\026' with Null value, because of escape character (" \ "), data is not replaced with NULL value.

 replace(col2, "026",  'abcd') 
 replace(Col2, "\026",  'abcd') 

Finally,

I want my data as

Col1|Col2|Col3 
1|NULL|026|abcd026efg. 

Highly appreciate for thoughts to resolve this scenario.

Thanks -EVR

enter image description here

EVR
  • 31
  • 4

1 Answers1

1

Use replace all digits and preceding non digits

 import pyspark.sql.functions as F
 df.withColumn('col2',F.regexp_replace('col2','\D\d+',None)).show()

+----+----+-----------+
|col1|col2|       col3|
+----+----+-----------+
|   1|null|abcd026efg.|
+----+----+-----------+
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • 1
    Thanks for your reply. got an error "name 'F' is not defined". do we need to import any library, can you please give me the snippet of the code – EVR Mar 07 '22 at 03:41
  • 1
    Hahh.. IT WORKED !!!!!!! Thanks for you help.... Can you brief me what's happening with '\026' and explain what does '\D\d+' means – EVR Mar 07 '22 at 05:50
  • I am using pythons regular expressions n. This is how it works \D for non digits \d for digits + Greedily match everything on the left – wwnde Mar 07 '22 at 06:25
  • My data is little more complex than earlier, '\026' is randomly spreadout through all the columns and I have replace to '\026' with NULL across all columns below is my sample input data col1|col2|col3 1|\026\026|abcd026efg 2|\026\026|\026\026\026 3|ad026eg|\026\026 4|ad026eg|xyad026 and, my out data should be col1|col2|col3 1|NULL|abcd026efg 2|NULL|NULL 3|ad026eg|NULL 4|ad026eg|xyad026 – EVR Mar 07 '22 at 06:40