I have a file lying in Azure Data lake gen 2 filesystem. I want to read the contents of the file and make some low level changes i.e. remove few characters from a few fields in the records. To be more explicit - there are some fields that also have the last character as backslash ('\'). And since the value is enclosed in the text qualifier (""), the field value escapes the '"' character and goes on to include the value next field too as the value of current field.
for e.g. the text file contains the following 2 records (ignore the header)
-------------------------------------------------------------------
Name | Address | Description | Remark
-------------------------------------------------------------------
"ABC" | "UK" | "descrption 1" | "remark1"
"DEF" | "USA" | "description2\" | "remark2"
When I read the above in pyspark data frame, it is read something like the following:
-------------------------------------------------------------------
Name | Address | Description | Remark
-------------------------------------------------------------------
"ABC" | "UK" | "descrption 1" | "remark1"
"DEF" | "USA" | "description2|remark2" | null
So, my objective is to read the above files using the usual file handling in python such as the follwoing and get rid of '\' character for those records that have that character and write the rows back into a new file.
f = open("test.txt",'r',encoding = 'utf-8')
//read the lines
//remove the '\' character
//write the line back
But since the file is lying in the ADLS gen 2 file system (HDFS like file system), the usual python file handling wont work here. What is the way out for file handling of ADLS gen 2 file system?
Or is there a way to solve this problem using spark data frame APIs?