-1

During the consumption of large datasets, it’s very common to encounter malformed/invalid lines. However, we don’t want to stop the pipeline every time a bad line is found. We need to react to this issue by storing the malformed records in a different path without stopping the pipeline. Without implementing any code, explain how you would approach a solution to this problem.

LearnIT
  • 336
  • 2
  • 7
  • 23

1 Answers1

0

This is extremely dependent on the situation. That being said, you could try:

Before writing the data, check/validate the given line to make sure the line is valid. If the line is valid, write as you normally would, otherwise write the invalid data to a different path.

So, if I have my raw data, I would pass the data through a function first.

Something like "isLineValid(line)".

def isLineValid(line):
  ...logic to check the validity
  return True/False

If it is not valid, I can redirect it.

Another way to handle this is with a try-except.

You can try to write the data, and if it fails, specify handling in the exception clause.

for line in data:
  try:
    ...write
  except:
    ...write to a different location
LearnIT
  • 336
  • 2
  • 7
  • 23