2

I have a file lying in Azure Data lake gen 2 filesystem. I want to read the contents of the file and make some low level changes i.e. remove few characters from a few fields in the records. To be more explicit - there are some fields that also have the last character as backslash ('\'). And since the value is enclosed in the text qualifier (""), the field value escapes the '"' character and goes on to include the value next field too as the value of current field.

for e.g. the text file contains the following 2 records (ignore the header)

-------------------------------------------------------------------
Name          | Address      | Description            | Remark
-------------------------------------------------------------------
"ABC"         | "UK"      | "descrption 1"            | "remark1"

"DEF"         | "USA"     | "description2\"           | "remark2"

When I read the above in pyspark data frame, it is read something like the following:

-------------------------------------------------------------------
Name          | Address      | Description            | Remark
-------------------------------------------------------------------
"ABC"         | "UK"      | "descrption 1"            | "remark1"

"DEF"         | "USA"     | "description2|remark2"    | null      

So, my objective is to read the above files using the usual file handling in python such as the follwoing and get rid of '\' character for those records that have that character and write the rows back into a new file.

f = open("test.txt",'r',encoding = 'utf-8')
//read the lines
//remove the '\' character 
//write the line back

But since the file is lying in the ADLS gen 2 file system (HDFS like file system), the usual python file handling wont work here. What is the way out for file handling of ADLS gen 2 file system?

Or is there a way to solve this problem using spark data frame APIs?

Kamal Nandan
  • 233
  • 1
  • 5
  • 11

2 Answers2

2

The Databricks documentation has information about handling connections to ADLS here. Depending on the details of your environment and what you're trying to do, there are several options available. For our team, we mounted the ADLS container so that it was a one-time setup and after that, anyone working in Databricks could access it easily.

DavidP
  • 613
  • 5
  • 12
0

Generate SAS for the file that needs to be read.

source ='SAS URL with Token'
f = open("source",'r',encoding = 'utf-8')

Referance: https://medium.com/@meetcpatel906/read-csv-file-from-azure-blob-storage-to-directly-to-data-frame-using-python-83d34c4cbe57

nl09
  • 93
  • 1
  • 9