How to handle escape characters in pyspark. Trying to replace escape character with NULL, when column value is '\026' in dataframe

Question

How to handle escape characters in pyspark. Trying to replace escape character with NULL

'\026' is randomly spreadout through all the columns and I have replace to '\026' with NULL across all columns

below is my sample input data

col1,col2,col3,Col4    
1,\026\026,abcd026efg,1|\026\026|abcd026efg            
2,\026\026,\026\026\026,2|026\026|\026\026\026         
3,ad026eg,\026\026,3|ad026eg|\026\026       
4,ad026eg,xyad026,4|ad026eg|xyad026

and, my out data should be

col1|col2|col3|col4|      
1,NULL,abcd026efg,1||abcd026efg|   
2,NULL,NULL,2|NULL|NULL|   
3,ad026eg,NULL,3|ad026eg|NULL|       
4,ad026eg,xyad026,4|ad026eg|xyad026|

Note: Col4 is combined columns of col1, col2, col3 with | delimited

 df.withColumn('col2',F.regexp_replace('col2','\D\d+',None)).show().
 This is working but it is replacing all the cell values with NULL.

Can you correct the formatting of your input and output data? It is bit difficult to distinguish between your column value and delimiter separator! — Dipanjan Mallick, Mar 08 '22 at 05:55

Chandra Babu · Answer 1 · 2022-03-11T11:19:07.353

0

Try this if u want to do it in rdd:

rddd=df.rdd.map(lambda x : [ re.sub(r"\026", "", x[i].strip()) for i in range(len(x)) ] ).map( lambda x :[ None if x[i] =="" else x[i].strip() for i in range(len(x)) ])

df2=rddd.toDF(["a","b","c","d"])

df2.show()

enter image description here

edited Mar 11 '22 at 11:19

answered Mar 10 '22 at 19:40

Chandra Babu

1
1

2

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 11 '22 at 02:57
1

Can you please brief me, the logic. I didn't understand – EVR Mar 14 '22 at 03:28

How to handle escape characters in pyspark. Trying to replace escape character with NULL, when column value is '\026' in dataframe

1 Answers1