0

I am running into an issue where in some of my fields, there are new lines within the text. My current code is as follows:

# Python's regular expression library
import re
import sys

# Beam and interactive Beam imports
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib


p = beam.Pipeline(InteractiveRunner())

def print_row(element):
    print(element)

def parse_file(element):
    for line in csv.reader([element], quotechar='"', delimiter=',', lineterminator='\n', quoting=csv.QUOTE_ALL, skipinitialspace=True):
        return line

parsed_csv = p | 'Read input file' >> beam.io.ReadFromText("gs://ny-data/AB_NYC_2019.csv")| 'Parse file' >> beam.Map(parse_file) 

split = parsed_csv | beam.Map(lambda x: x[0]) | beam.Map(print)

p.run()

I am running into issues because some of the text appears as so:

The BLUE OWL:
VEGETARIAN WBURG W PATIO & BACKYARD!

Any thoughts on how to proceed?

Ryan Tom
  • 195
  • 3
  • 14

1 Answers1

1

ReadFromText reads inputs one line at a time. As suggested before, you can use the Dataframe read_csv, or you could create a PCollection of paths and open/read them in a DoFn.

For example, you could write

def read_csv_file(file_metadata):
  with beam.io.filesystems.FileSystems.open(file_metadata.path) as fin:
    for row in csv.reader(fin):
        yield row

rows = (
    p
    | beam.io.fileio.MatchFiles('/pattern/to/files/*.csv')  # emits FileMetadatas
    | beam.FlatMap(read_csv_file))                          # emits rows
robertwb
  • 4,891
  • 18
  • 21
  • Thanks for the response. I am trying to do this without the use of beam dataframes, although they seem very useful. Can you elaborate on what you mean by creating a PColl of paths and open/read them in a DoFn? – Ryan Tom Jul 02 '21 at 05:16