4

I have a dictionary of values that I would like to write to GCS as a valid .CSV file using the Python SDK. I can write the dictionary out as newline separated text file, but I can't seem to find an example converting the dictionary to a valid .CSV. Can anybody suggest the best way to generate csv's within a dataflow pipeline? This answers to this question address Reading from CSV files, but don't really address writing to CSV files. I recognize that CSV files are just text files with rules, but I'm still struggling to convert the dictionary of data to a CSV that can be written using WriteToText.

Here is a simple example dictionary that I would like to turn into a CSV:

test_input = [{'label': 1, 'text': 'Here is a sentence'},
              {'label': 2, 'text': 'Another sentence goes here'}]


test_input  | beam.io.WriteToText(path_to_gcs)

The above would result in a text file that had each dictionary on a newline. Is there any functionality within Apache Beam that I can take advantage of (similar to csv.DictWriter)?

reese0106
  • 2,011
  • 2
  • 16
  • 46

2 Answers2

5

Generally you will want to write a function that can convert your original dict data elements into a csv-formatted string representation.

That function can be written as a DoFn that you can apply to your Beam PCollection of data, which would convert each collection element into the desired format; you can do this by applying the DoFn to your PCollection via ParDo. You can also wrap this DoFn in a more user-friendly PTransform.

You can learn more about this process in the Beam Programming Guide

Here is a simple, translatable non-Beam example:

# Our example list of dictionary elements
test_input = [{'label': 1, 'text': 'Here is a sentence'},
             {'label': 2, 'text': 'Another sentence goes here'}]

def convert_my_dict_to_csv_record(input_dict):
    """ Turns dictionary values into a comma-separated value formatted string """
    return ','.join(map(str, input_dict.values()))

# Our converted list of elements
converted_test_input = [convert_my_dict_to_csv_record(element) for element in test_input]

The converted_test_input will look like the following:

['Here is a sentence,1', 'Another sentence goes here,2']

Beam DictToCSV DoFn and PTransform example using DictWriter

from csv import DictWriter
from csv import excel
from cStringIO import StringIO

...

def _dict_to_csv(element, column_order, missing_val='', discard_extras=True, dialect=excel):
    """ Additional properties for delimiters, escape chars, etc via an instance of csv.Dialect
        Note: This implementation does not support unicode
    """

    buf = StringIO()

    writer = DictWriter(buf,
                        fieldnames=column_order,
                        restval=missing_val,
                        extrasaction=('ignore' if discard_extras else 'raise'),
                        dialect=dialect)
    writer.writerow(element)

    return buf.getvalue().rstrip(dialect.lineterminator)


class _DictToCSVFn(DoFn):
    """ Converts a Dictionary to a CSV-formatted String

        column_order: A tuple or list specifying the name of fields to be formatted as csv, in order
        missing_val: The value to be written when a named field from `column_order` is not found in the input element
        discard_extras: (bool) Behavior when additional fields are found in the dictionary input element
        dialect: Delimiters, escape-characters, etc can be controlled by providing an instance of csv.Dialect

    """

    def __init__(self, column_order, missing_val='', discard_extras=True, dialect=excel):
        self._column_order = column_order
        self._missing_val = missing_val
        self._discard_extras = discard_extras
        self._dialect = dialect

    def process(self, element, *args, **kwargs):
        result = _dict_to_csv(element,
                              column_order=self._column_order,
                              missing_val=self._missing_val,
                              discard_extras=self._discard_extras,
                              dialect=self._dialect)

        return [result,]

class DictToCSV(PTransform):
    """ Transforms a PCollection of Dictionaries to a PCollection of CSV-formatted Strings

        column_order: A tuple or list specifying the name of fields to be formatted as csv, in order
        missing_val: The value to be written when a named field from `column_order` is not found in an input element
        discard_extras: (bool) Behavior when additional fields are found in the dictionary input element
        dialect: Delimiters, escape-characters, etc can be controlled by providing an instance of csv.Dialect

    """

    def __init__(self, column_order, missing_val='', discard_extras=True, dialect=excel):
        self._column_order = column_order
        self._missing_val = missing_val
        self._discard_extras = discard_extras
        self._dialect = dialect

    def expand(self, pcoll):
        return pcoll | ParDo(_DictToCSVFn(column_order=self._column_order,
                                          missing_val=self._missing_val,
                                          discard_extras=self._discard_extras,
                                          dialect=self._dialect)
                             )

To use the example, you would put your test_input into a PCollection, and apply the DictToCSV PTransform to the PCollection; you can take the resulting converted PCollection and use it as input for WriteToText. Note that you must provide a list or tuple of column names, via the column_order argument, corresponding to keys for your dictionary input elements; the resulting CSV-formatted string columns will be in the order of the column names provided. Also, the underlying implementation for the example does not support unicode.

Andrew Mo
  • 1,433
  • 9
  • 12
  • 1
    Thanks, Andrew! This makes sense to me - think that I understand the mechanics and what I was wondering is if anything like ConverDictToCSVFn() existed within Apache-Beam or if it had to be written from scratch. Writing this type of function is not trivial, because if the sentence contains a comma (or whatever your separator is) then you would typically need to surround the entire sentence with double quotes "". I'm guessing this response suggests that there is nothing within Apache-Beam already set up to handle these cases? – reese0106 Oct 17 '17 at 19:17
  • From what I can tell `textio` doesn't appear to have this convenience available -- In the meantime, I believe this could be implemented in Python by combining the `csv` module's `DictWriter` with the Python `StringIO` module. – Andrew Mo Oct 17 '17 at 20:42
  • Could you provide any more guidance on what you are suggesting? Perhaps with a separate answer? This is really the crux of my question as I have not been able to finda way to make use of DictWriter... – reese0106 Oct 18 '17 at 15:26
  • Sure -- I'll update the response above with some code that implements this using `DictWriter` and `StringIO` -- I'll be working with the Beam SDK team to see if we can get this added via a Pull Request too. – Andrew Mo Oct 18 '17 at 17:37
  • Thanks! I added an answer below based on your earlier suggestions, but I haven't figured out how to make use of DictWriter yet so that would be great. – reese0106 Oct 18 '17 at 18:02
  • Updated my earlier answer with an implementation using `DictWriter` as part of a `DoFn` and `PTransform` :) – Andrew Mo Oct 18 '17 at 18:19
  • how to write csv header? – akram Aug 14 '20 at 13:55
  • Do you know if there has been any progress on getting the ability to write out CSV values? In my case I don't even have dicts, just tuples in my `PCollection`. I think what's needed is something like `WriteToCSV` similar to `WriteToText`. – Stephen Dec 10 '20 at 22:36
-1

Based on Andrew's suggestion, here is a ConvertDictToCSV function that I created:

def ConvertDictToCSV(input_dict, fieldnames, separator=",", quotechar='"'):
  value_list = []
  for field in fieldnames:
    if input_dict[field]:
      field_value = str(input_dict[field])
    else:
      field_value = ""
    if separator in field_value:
      field_value = quotechar + field_value + quotechar
    value_list.append(field_value)

  return separator.join(value_list)

This appears to be working well, but would certainly be safer to make use of csv.DictWriter if possible

reese0106
  • 2,011
  • 2
  • 16
  • 46