1

Background:

I currently have large files getting uploaded to AWS S3, these files contain new line characters in some columns which leads to them being read incorrectly. However, the files do have a very specific column separator ~#~. In order to remove incorrect newlines I'm currently streaming the files through aws glue and combining each line with the next line if the current row doesn't have the number of columns it should.

Example:

take the row: "val1"~#~"va\nl\n2"~#~"val3" which gets presented like this

"val1"~#~"va
l
2"~#~"val3"

by going line by line, using:

colnum=3
for row in f:
    while not len(row.split('~#~'))==colnum:
        row += next(f)
cleanrow = row.replace('\n','. ')+'\n

cleanrow example would return the example on one row like this Expected output:

"val1"~#~"va. l. 2"~#~"val3"

Issue:

It currently takes way too long to stream these large files through a computer to clean them, even if the computer is on the AWS network. I've therefore looked into using pyspark for this, I've tried to set custom newline character like this spark._jsc.hadoopConfiguration().set("textinputformat.record.delimiter","\"\n") but the problem is that it turns out we can have '"\n' in the text field as well, meaning not all rows gets fixed. I'm new to pyspark, so not really sure where to start. I've tried map, flatMap and reduce but it doesn't seem like that's what I'm after as they seem to either only use current line, or combine all lines into one. The closest I've found on SO is this post that uses a sliding function, but the question is a bit different than what I'm trying to achieve and I can't find any documentation of that one in pyspark, only scala.

Other suggestions on how to solve the new line issue using other tools that can be implemented in AWS glue that don't involving streaming the dataset, would be welcome. (The file is too large to fit in memory)

Pureluck
  • 326
  • 2
  • 10

1 Answers1

1

I managed to solve my issue

#first I read in the data
rdd = spark.sparkContext.textFile(MessyFile)

#the first line is expected to have the correct number of columns (no linebreaks within a column)
cols = len(rdd.first().split("~#~"))

#I store the already "correct" rows in one RDD, and the incorrect ones in a different RDD
correct = rdd.filter(lambda x: len(x.split("~#~"))==cols)
wrong = rdd.filter(lambda x: len(x.split("~#~"))!=cols)

#The incorrect rows are now so small that they will fit in memory, so I can make RDD into an iterable list
fix = iter(wrong.collect())
fixed = []

#I then iterate over all the rows in the incorrect list and add next row until the row has the expected number of columns, and I add ". " to indicate where there was a linebreak
#The new rows are added to a new list called fixed
for item in fix:
    row = item
    while len(row.split("~#~"))!=cols:
        row+='. '+next(fix)
    fixed.append(row)

#I then union the already correct rows with the newly fixed rows
new = correct.union(spark.sparkContext.parallelize(fixed)) \
        .map(lambda row: row.split("~#~"))

#I then create a dataframe, assing the first row as header and write it out as a parquet file
header = new.first()
df = new.filter(lambda line: line != header).toDF()
oldcols = df.columns

df = reduce(lambda df, idx:df.withColumnRenamed(oldcols[idx],header[idx]),range(len(oldcols)),df)

df.coalesce(10).write.parquet(CleanFile,mode='overwrite') 

The only issues I can think of with above is if the number of incorrect rows would be more than can fit in memory (unlikely) or there is a line break in first or last column (unlikely in my files)

Pureluck
  • 326
  • 2
  • 10