I'm still new to Beam, but how exactly do you Read From CSV Files that are in GCS Buckets? I essentially what to transform these files into a pandas dataframe using Beam and then apply an sklearn model to "train" this data. Most of the examples I've seen pre-define the header, I want this Beam pipeline to generalize to any files where the headers will definitely be different. There's a library called beam_utils that does what I want to do, but then I run into this error: AttributeError: module 'apache_beam.io.fileio' has no attribute 'CompressionTypes'
Code Example:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
# The error occurs in this import
from beam_utils.sources import CsvFileSource
options = {
'project': 'my-project',
'runner:': 'DirectRunner',
'streaming': False
}
pipeline_options = PipelineOptions(flags=[], **options)
class Printer(beam.DoFn):
def process(self, element):
print(element)
with beam.Pipeline(options=pipeline_options) as p: # Create the Pipeline with the specified options.
data = (p
| 'Read File From GCS' >> beam.io.textio.ReadFromText('gs://my-csv-files')
)
_ = (data | "Print the data" >> beam.ParDo(Printer()))
result = p.run()
result.wait_until_finish()