2

I'm still new to Beam, but how exactly do you Read From CSV Files that are in GCS Buckets? I essentially what to transform these files into a pandas dataframe using Beam and then apply an sklearn model to "train" this data. Most of the examples I've seen pre-define the header, I want this Beam pipeline to generalize to any files where the headers will definitely be different. There's a library called beam_utils that does what I want to do, but then I run into this error: AttributeError: module 'apache_beam.io.fileio' has no attribute 'CompressionTypes'

Code Example:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

# The error occurs in this import
from beam_utils.sources import CsvFileSource

options = {
    'project': 'my-project',
    'runner:': 'DirectRunner',
    'streaming': False
}

pipeline_options = PipelineOptions(flags=[], **options)

class Printer(beam.DoFn):
    def process(self, element):
        print(element)

with beam.Pipeline(options=pipeline_options) as p:  # Create the Pipeline with the specified options.

    data = (p
            | 'Read File From GCS' >> beam.io.textio.ReadFromText('gs://my-csv-files')
            )

    _ = (data | "Print the data" >> beam.ParDo(Printer()))

result = p.run()
result.wait_until_finish()
Tlaquetzal
  • 2,760
  • 1
  • 12
  • 18
Riley Hun
  • 2,541
  • 5
  • 31
  • 77
  • 1
    have you tried this? https://stackoverflow.com/questions/41170997/how-to-convert-csv-into-a-dictionary-in-apache-beam-dataflow/41171867#41171867 – Pablo Jan 30 '20 at 23:17
  • Is there a particular reason you want to use beam for creating the pandas dataframes? Check [this answer](https://stackoverflow.com/a/48837706/7517757) that talks about the use of pandas in apache beam – Tlaquetzal Feb 07 '20 at 02:31

1 Answers1

6

The Apache Beam module fileio has being recently modified with backward incompatible changes, and the library beam_utils hasn't been updated yet.

I went through the question suggested by @Pablo and the source code of beam_utils (also written by Pablo) to replicate the behavior using the filesystems module.

Below are two versions of the code using pandas to generate the DataFrame(s).

csv used for the example:

a,b
1,2
3,4
5,6

Reading the csv and creating the DataFrame with all its content

import apache_beam as beam
import pandas as pd
import csv
import io

def create_dataframe(readable_file):

    # Open a channel to read the file from GCS
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)

    # Read it as csv, you can also use csv.reader
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))

    # Create the DataFrame
    dataFrame = pd.DataFrame(csv_dict)
    print(dataFrame.to_string())

p = beam.Pipeline()
(p | beam.Create(['gs://my-bucket/my-file.csv'])
   | beam.FlatMap(create_dataframe)
)

p.run()

Resulting DataFrame

   a  b
0  1  2
1  3  4
2  5  6

Reading the csv and creating the DataFrames in other transformation

def get_csv_reader(readable_file):

    # Open a channel to read the file from GCS
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)

    # Return the csv reader
    return  csv.DictReader(io.TextIOWrapper(gcs_file))

p = beam.Pipeline()
(p | beam.Create(['gs://my-bucket/my-file.csv'])
   | beam.FlatMap(get_csv_reader)
   | beam.Map(lambda x: pd.DataFrame([x])) # Create the DataFrame from each csv row
   | beam.Map(lambda x: print(x.to_string()))
)

Resulting DataFrames

   a  b
0  1  2
   a  b
0  3  4
   a  b
0  5  6
Tlaquetzal
  • 2,760
  • 1
  • 12
  • 18
  • 1
    Interesting but looks like this creates single partition which means you won't get parallel processing power of Apache Beam / Dataflow. So not sure if it even makes sense to use it to solve above problem in the first place. Though it can work great if you have multiple CSV files and not one large file. – chhantyal Jun 04 '20 at 15:16