During the consumption of large datasets, it’s very common to encounter malformed/invalid lines. However, we don’t want to stop the pipeline every time a bad line is found. We need to react to this issue by storing the malformed records in a different path without stopping the pipeline. Without implementing any code, explain how you would approach a solution to this problem.
Asked
Active
Viewed 51 times
1 Answers
0
This is extremely dependent on the situation. That being said, you could try:
Before writing the data, check/validate the given line to make sure the line is valid. If the line is valid, write as you normally would, otherwise write the invalid data to a different path.
So, if I have my raw data, I would pass the data through a function first.
Something like "isLineValid(line)".
def isLineValid(line):
...logic to check the validity
return True/False
If it is not valid, I can redirect it.
Another way to handle this is with a try-except.
You can try to write the data, and if it fails, specify handling in the exception clause.
for line in data:
try:
...write
except:
...write to a different location

LearnIT
- 336
- 2
- 7
- 23
-
https://stackoverflow.com/questions/54199303/using-pyspark-how-to-reject-bad-malformed-records-from-csv-file-and-save-these – Hrishikesh Mawal Aug 11 '22 at 00:29