3

I am using CSVRecordSource to read the CSV in Apache Beam pipeline that uses open_file in read_records function.

With python 2 everything worked fine, but when I migrated to python 3 it complains about below

next(csv_reader)
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

By default open_file method opens the file in binary mode.

So I changed it to use

with open(filename, "rt") as f:

but it fails when I run the dataflow in Google cloud as it is not able to find the file and gives error

FileNotFoundError: [Errno 2] No such file or directory

Below is my code

 with self.open_file(filename) as f:
      csv_reader = csv.reader(f, delimiter=self.delimiter, quotechar=self.quote_character)
      header = next(csv_reader)

How can I use CSVRecordSource with python 3?

tank
  • 465
  • 8
  • 22

2 Answers2

0

Are you using the open_file method defined here: https://github.com/apache/beam/blob/6f6feaaeebfc82302ba83c52d087b06a12a5b119/sdks/python/apache_beam/io/filebasedsource.py#L166?

If so, I think you can just call the underlying FileSystems.open() with 'application/octet-stream' replaced by 'text/plain'.

Yueyang Qiu
  • 159
  • 5
0

I solved it by using iterdecode that iteratively decodes the input(bytes) provided by iterator

csv.reader(codecs.iterdecode(f, "utf-8"), delimiter=self.delimiter, quotechar=self.quote_character)
tank
  • 465
  • 8
  • 22