How to convert csv into a dictionary in apache beam dataflow

Question

I would like to read a csv file and write it to BigQuery using apache beam dataflow. In order to do this I need to present the data to BigQuery in the form of a dictionary. How can I transform the data using apache beam in order to do this?

My input csv file has two columns, and I want to create a subsequent two column table in BigQuery. I know how to create data in BigQuery, thats straight forward, what I don't know is how to transform the csv into a dictionary. The below code is not correct but should give an idea of what i'm trying to do.

# Standard imports
import apache_beam as beam
# Create a pipeline executing on a direct runner (local, non-cloud).
p = beam.Pipeline('DirectPipelineRunner')
# Create a PCollection with names and write it to a file.
(p
| 'read solar data' >> beam.Read(beam.io.TextFileSource('./sensor1_121116.csv'))
# How do you do this??
| 'convert to dictionary' >> beam.Map(lambda (k, v): {'luminosity': k, 'datetime': v})
| 'save' >> beam.Write(
   beam.io.BigQuerySink(
   output_table,
   schema='month:INTEGER, tornado_count:INTEGER',
   create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
   write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run()

Pablo · Accepted Answer · 2019-05-28T17:56:49.363

28

Edit: as of version 2.12.0, Beam comes with new fileio transforms that allow you to read from CSV without having to reimplement a source. You can do this like so:

def get_csv_reader(readable_file):
  # You can return whichever kind of reader you want here
  # a DictReader, or a normal csv.reader.
  if sys.version_info >= (3, 0):
    return csv.reader(io.TextIOWrapper(readable_file.open()))
  else:
    return csv.reader(readable_file.open())

with Pipeline(...) as p:
  content_pc = (p
                | beam.io.fileio.MatchFiles("/my/file/name")
                | beam.io.fileio.ReadMatches()
                | beam.Reshuffle()  # Useful if you expect many matches
                | beam.FlatMap(get_csv_reader))

I recently wrote a test for this for Apache Beam. You can take a look on the Github repository.

The old answer relied on reimplementing a source. This is no longer the main recommended way of doing this : )

The idea is to have a source that returns parsed CSV rows. You can do this by subclassing the FileBasedSource class to include CSV parsing. Particularly, the read_records function would look something like this:

class MyCsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
  def read_records(self, file_name, range_tracker):
    self._file = self.open_file(file_name)

    reader = csv.reader(self._file)

    for rec in reader:
      yield rec

edited May 28 '19 at 17:56

answered Dec 15 '16 at 19:25

Pablo

10,425
1
44
67

2

Many thanks Pablo, this works really well! Here is a code snippet in case people are looking for completeness (p | 'read solar data' >> beam.Read(CsvFileSource('./sensor1_121116.csv')) | 'save' >> beam.Write(beam.io.TextFileSink('./greetings_solar'))) – user1753640 Dec 16 '16 at 10:37
1

I'm trying to write the results into BigQuery, but no luck, the table gets created but no data. Can you tell whats going on? Here is a snippet (p | 'read solar data' >> beam.Read(CsvFileSource('./sensor1_121116.csv')) | 'save' >> beam.Write( beam.io.BigQuerySink( output_table, schema='lumosity:INTEGER, time:INTEGER', create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))) – user1753640 Dec 16 '16 at 12:55
@user1753640: I had the same problem and had to use a dictionary matching the schema before storing datas into GBQ. – vdolez Nov 16 '17 at 10:07
I'm trying to `from beam_utils.sources import CsvFileSource`, but it's complaining about not being able to `from apache_beam.io import fileio`. I've done `pip install beam_utils` and `pip install apache_beam` - is there something else I need to install? If so, why didn't `beam_utils` specify that in its dependencies? – Tomas Aschan Mar 29 '18 at 09:41
The pip package is not up to date with the recent commits to the repo. I used `pip install git+https://github.com/pabloem/beam_utils` to get the latest (working) version. – dumkar Mar 30 '18 at 15:33
This code works. However, it is being executed twice. Why? – Jaison Jul 04 '18 at 10:53
2

Maybe a little late, but after playing with this for a while, I discovered that the _completeness_ code should be this `(p | 'read solar data' >> beam.io.Read(CsvFileSource('./sensor1_121116.csv')) | 'save' >> beam.io.Write(beam.io.TextFileSink('./greetings_solar')))` Notice the **io** inserted after _beam_ references in the apply's invocations – jquinter Nov 19 '18 at 19:25

score 5 · Answer 2 · answered Dec 19 '17 at 14:11

As a supplement to Pablo's post, I'd like to share a little change I made myself to his sample. (+1 for you!)

Changed: reader = csv.reader(self._file) to reader = csv.DictReader(self._file)

The csv.DictReader uses the first row of the CSV file as Dict keys. The other rows are used to populate a dict per row with it's values. It'll automatically put the right values to the correct keys based on column order.

One little detail is that every value in the Dict is stored as string. This may conflict your BigQuery schema if you use eg. INTEGER for some fields. So you need to take care of proper casting afterwards.

this doesn't quite work: https://issues.apache.org/jira/browse/BEAM-10573 — Travis Webb, Dec 14 '20 at 06:27

score 0 · Answer 3 · answered Jun 13 '23 at 11:47

0

AttributeError: module 'apache_beam.io' has no attribute 'TextFileSink' -> is there an update to this code?

answered Jun 13 '23 at 11:47

km_jsh12

11
1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 13 '23 at 16:41
This does not really answer the question. If you have a different question, you can ask it by clicking [Ask Question](https://stackoverflow.com/questions/ask). To get notified when this question gets new answers, you can [follow this question](https://meta.stackexchange.com/q/345661). Once you have enough [reputation](https://stackoverflow.com/help/whats-reputation), you can also [add a bounty](https://stackoverflow.com/help/privileges/set-bounties) to draw more attention to this question. - [From Review](/review/late-answers/34539709) – Chenmunka Jun 16 '23 at 20:44

How to convert csv into a dictionary in apache beam dataflow

3 Answers3

Linked