Apache Beam Dataflow Reading big CSV with splittable=True causing duplicate entries

Question

I used the code snippet below to read CSV files into the pipeline as Dicts.

class MyCsvFileSource(beam.io.filebasedsource.FileBasedSource):
    def read_records(self, file_name, range_tracker):
        self._file = self.open_file(file_name)

        reader = csv.DictReader(self._file, dialect=MyCustomDialect)

        for rec in reader:
            yield rec

This snippet is almost literally copied over from a post in How to convert csv into a dictionary in apache beam dataflow (Pablo's answer).

Later I noticed that this all goes well with relatively small files (eg. 35k lines). But with larger files eg. 700k rows, I saw duplicates being generated in the output (BigQuery). Almost with a factor 5, so I ended up with over 3M rows.

I took a closer look to the beam.io.filebasedsource.FileBasedSource and saw the argument splitted which has been set to True by default.

The documentation says this:

splittable (bool): whether :class:`FileBasedSource` should try to
logically split a single file into data ranges so that different parts
of the same file can be read in parallel. If set to :data:`False`,
:class:`FileBasedSource` will prevent both initial and dynamic splitting
of sources for single files. File patterns that represent multiple files
may still get split into sources for individual files. Even if set to
:data:`True` by the user, :class:`FileBasedSource` may choose to not
split the file, for example, for compressed files where currently it is
not possible to efficiently read a data range without decompressing the
whole file.

When the argument is set to True it is able to read the source file in parallel.

I noticed that if I set this argument to False, the file reads fine and I get no duplicates.

Currently I keep this splittable argument set to False since it keeps the duplicates out, but I'm not sure if this is future proof when my files will grow in lines.

Is it possible that there might be some issue with reading source files in parallel? Is there something I overlooked or didn't take care of the right way?

score 0 · Answer 1 · answered Jan 08 '19 at 19:36

To support splitting without duplicates your have to use the passed 'range_tracker' object when reading from your source. For example, you have to invoke try_claim() when claiming unique positions of the file you are reading.

Please see following for more information. https://beam.apache.org/documentation/sdks/python-custom-io/

Apache Beam Dataflow Reading big CSV with splittable=True causing duplicate entries

1 Answers1