open_file in beam.io FileBasedSource issue with python 3

Question

I am using CSVRecordSource to read the CSV in Apache Beam pipeline that uses open_file in read_records function.

With python 2 everything worked fine, but when I migrated to python 3 it complains about below

next(csv_reader)
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

By default open_file method opens the file in binary mode.

So I changed it to use

with open(filename, "rt") as f:

but it fails when I run the dataflow in Google cloud as it is not able to find the file and gives error

FileNotFoundError: [Errno 2] No such file or directory

Below is my code

 with self.open_file(filename) as f:
      csv_reader = csv.reader(f, delimiter=self.delimiter, quotechar=self.quote_character)
      header = next(csv_reader)

How can I use CSVRecordSource with python 3?

Please let me know where you are using this function? in DoFn? — Sach, Oct 23 '19 at 02:49
I am using it in Read(CSVRecordSource(input)) in Beam pipeline. — tank, Oct 23 '19 at 08:40

score 0 · Answer 1 · answered Oct 23 '19 at 05:02

Are you using the open_file method defined here: https://github.com/apache/beam/blob/6f6feaaeebfc82302ba83c52d087b06a12a5b119/sdks/python/apache_beam/io/filebasedsource.py#L166?

If so, I think you can just call the underlying FileSystems.open() with 'application/octet-stream' replaced by 'text/plain'.

score 0 · Accepted Answer · answered Oct 23 '19 at 08:43

0

I solved it by using iterdecode that iteratively decodes the input(bytes) provided by iterator

csv.reader(codecs.iterdecode(f, "utf-8"), delimiter=self.delimiter, quotechar=self.quote_character)

answered Oct 23 '19 at 08:43

tank

465
8
22

open_file in beam.io FileBasedSource issue with python 3

2 Answers2