Background:
I currently have large files getting uploaded to AWS S3, these files contain new line characters in some columns which leads to them being read incorrectly. However, the files do have a very specific column separator ~#~
. In order to remove incorrect newlines I'm currently streaming the files through aws glue and combining each line with the next line if the current row doesn't have the number of columns it should.
Example:
take the row: "val1"~#~"va\nl\n2"~#~"val3"
which gets presented like this
"val1"~#~"va
l
2"~#~"val3"
by going line by line, using:
colnum=3
for row in f:
while not len(row.split('~#~'))==colnum:
row += next(f)
cleanrow = row.replace('\n','. ')+'\n
cleanrow example would return the example on one row like this Expected output:
"val1"~#~"va. l. 2"~#~"val3"
Issue:
It currently takes way too long to stream these large files through a computer to clean them, even if the computer is on the AWS network. I've therefore looked into using pyspark for this, I've tried to set custom newline character like this spark._jsc.hadoopConfiguration().set("textinputformat.record.delimiter","\"\n")
but the problem is that it turns out we can have '"\n'
in the text field as well, meaning not all rows gets fixed. I'm new to pyspark, so not really sure where to start. I've tried map
, flatMap
and reduce
but it doesn't seem like that's what I'm after as they seem to either only use current line, or combine all lines into one. The closest I've found on SO is this post that uses a sliding
function, but the question is a bit different than what I'm trying to achieve and I can't find any documentation of that one in pyspark, only scala.
Other suggestions on how to solve the new line issue using other tools that can be implemented in AWS glue that don't involving streaming the dataset, would be welcome. (The file is too large to fit in memory)