I have a simple Apache Beam pipeline which reads compressed bz2 files and writes them out to text files.
import apache_beam as beam
p1 = beam.Pipeline()
(p1
| 'read' >> beam.io.ReadFromText('bad_file.bz2')
| 'write' >> beam.io.WriteToText('file_out.txt')
)
p1.run()
The problem is when the pipeline encounters a bad file (example). In this case, most of my bad files are malformed, not in bz2 format or simply empty, which confuses the decompressor, causing an OSError: Invalid data stream
.
How can I tell ReadFromText to pass
on these?