4

I am reading a comma-separated CSV file where the fields are enclosed in double quotes, and some of them also have commas within their values, like: "abc","def,ghi","jkl"

Is there a way we can read this file into a PCollection using Apache Beam?

Stephen
  • 8,508
  • 12
  • 56
  • 96
vaibhav v
  • 83
  • 2
  • 4
  • 1
    The question doesn't appear to include any attempt at all to solve the problem. Please edit the question to show what you've tried, and show a specific roadblock you're running into with a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). For more information, please see [How to Ask](https://stackoverflow.com/help/how-to-ask). – liakoyras Sep 11 '19 at 13:25

1 Answers1

7

Sample csv file having data enclosed in double quotes.

"AAA", "BBB", "Test, Test", "CCC" 
"111", "222, 333", "XXX", "YYY, ZZZ"

You can use the csv module from the standard library:

def print_row(element):
  print element

def parse_file(element):
  for line in csv.reader([element], quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True):
    return line

parsed_csv = (
                p 
                | 'Read input file' >> beam.io.ReadFromText(input_filename)
                | 'Parse file' >> beam.Map(parse_file)
                | 'Print output' >> beam.Map(print_row)
             )

This gives the following output

['AAA', 'BBB', 'Test, Test', 'CCC']
['111', '222, 333', 'XXX', 'YYY, ZZZ ']

The one thing to watch out for is that the csv.reader objects expect an iterator which will return iterator of strings. This means that you can't pass a string straight to a reader(), but you can enclose it in a list as above. You would then iterate over the output to get final string.

Faizan Saeed
  • 143
  • 7
  • Thanks Faizan, this is what i was looking for. – vaibhav v Sep 13 '19 at 15:33
  • Awesome answer. I like something like `mycsvreader = csv.reader(...)` followed by `return next(mycsvreader)` to clarify in the code that only one row is going to get returned, since the movement of rows through Beam can be confusing for beginners. Basically, mycsvreader will be an iterator and you get the next one since that's all there is. – Stephen Feb 25 '21 at 20:12