0

How to handle escape characters in pyspark. Trying to replace escape character with NULL

'\026' is randomly spreadout through all the columns and I have replace to '\026' with NULL across all columns

below is my sample input data

col1,col2,col3,Col4    
1,\026\026,abcd026efg,1|\026\026|abcd026efg            
2,\026\026,\026\026\026,2|026\026|\026\026\026         
3,ad026eg,\026\026,3|ad026eg|\026\026       
4,ad026eg,xyad026,4|ad026eg|xyad026  

and, my out data should be

col1|col2|col3|col4|      
1,NULL,abcd026efg,1||abcd026efg|   
2,NULL,NULL,2|NULL|NULL|   
3,ad026eg,NULL,3|ad026eg|NULL|       
4,ad026eg,xyad026,4|ad026eg|xyad026|

Note: Col4 is combined columns of col1, col2, col3 with | delimited

 df.withColumn('col2',F.regexp_replace('col2','\D\d+',None)).show().
 This is working but it is replacing all the cell values with NULL.
Mahesh Gupta
  • 1,882
  • 12
  • 16
EVR
  • 31
  • 4

1 Answers1

0

Try this if u want to do it in rdd:

rddd=df.rdd.map(lambda x : [ re.sub(r"\026", "", x[i].strip()) for i in range(len(x)) ] ).map( lambda x :[ None if x[i] =="" else x[i].strip() for i in range(len(x)) ])

df2=rddd.toDF(["a","b","c","d"])

df2.show()

enter image description here

  • 2
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 11 '22 at 02:57
  • 1
    Can you please brief me, the logic. I didn't understand – EVR Mar 14 '22 at 03:28