How to handle newlines when loading a CSV into Apache Beam?

Question

I am running into an issue where in some of my fields, there are new lines within the text. My current code is as follows:

# Python's regular expression library
import re
import sys

# Beam and interactive Beam imports
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib


p = beam.Pipeline(InteractiveRunner())

def print_row(element):
    print(element)

def parse_file(element):
    for line in csv.reader([element], quotechar='"', delimiter=',', lineterminator='\n', quoting=csv.QUOTE_ALL, skipinitialspace=True):
        return line

parsed_csv = p | 'Read input file' >> beam.io.ReadFromText("gs://ny-data/AB_NYC_2019.csv")| 'Parse file' >> beam.Map(parse_file) 

split = parsed_csv | beam.Map(lambda x: x[0]) | beam.Map(print)

p.run()

I am running into issues because some of the text appears as so:

The BLUE OWL:
VEGETARIAN WBURG W PATIO & BACKYARD!

Any thoughts on how to proceed?

robertwb · Answer 1 · 2021-07-02T21:16:01.907

1

ReadFromText reads inputs one line at a time. As suggested before, you can use the Dataframe read_csv, or you could create a PCollection of paths and open/read them in a DoFn.

For example, you could write

def read_csv_file(file_metadata):
  with beam.io.filesystems.FileSystems.open(file_metadata.path) as fin:
    for row in csv.reader(fin):
        yield row

rows = (
    p
    | beam.io.fileio.MatchFiles('/pattern/to/files/*.csv')  # emits FileMetadatas
    | beam.FlatMap(read_csv_file))                          # emits rows

edited Jul 02 '21 at 21:16

answered Jul 01 '21 at 23:04

robertwb

4,891
18
21

Thanks for the response. I am trying to do this without the use of beam dataframes, although they seem very useful. Can you elaborate on what you mean by creating a PColl of paths and open/read them in a DoFn? – Ryan Tom Jul 02 '21 at 05:16

How to handle newlines when loading a CSV into Apache Beam?

1 Answers1

Linked