During consumption of large datasets, encounter malformed/invalid lines

Question

During the consumption of large datasets, it’s very common to encounter malformed/invalid lines. However, we don’t want to stop the pipeline every time a bad line is found. We need to react to this issue by storing the malformed records in a different path without stopping the pipeline. Without implementing any code, explain how you would approach a solution to this problem.

score 0 · Answer 1 · answered Aug 09 '22 at 22:49

This is extremely dependent on the situation. That being said, you could try:

Before writing the data, check/validate the given line to make sure the line is valid. If the line is valid, write as you normally would, otherwise write the invalid data to a different path.

So, if I have my raw data, I would pass the data through a function first.

Something like "isLineValid(line)".

def isLineValid(line):
  ...logic to check the validity
  return True/False

If it is not valid, I can redirect it.

Another way to handle this is with a try-except.

You can try to write the data, and if it fails, specify handling in the exception clause.

for line in data:
  try:
    ...write
  except:
    ...write to a different location

https://stackoverflow.com/questions/54199303/using-pyspark-how-to-reject-bad-malformed-records-from-csv-file-and-save-these — Hrishikesh Mawal, Aug 11 '22 at 00:29

During consumption of large datasets, encounter malformed/invalid lines

1 Answers1