2

TLDR;

How to correct trigger count windows with python SDK?

Problem

I'm trying to make a pipeline for transforming and indexing a Wikipedia dump. The objective is:

  1. Read from a compressed file - just one process and in a streaming fashion as the file doesn't fit in RAM
  2. Process each element in parallel (ParDo)
  3. Group these elements in a count window (GroupBy in just one key to do streaming -> batch ) in just one process to save them in a DB.

Development

For that, I created a simple source class that returns a tuple in the form (index,data, counting):

class CountingSource(beam.io.filebasedsource.FileBasedSource):
    def read_records(self, file_name, offset_range_tracker):
        # timestamp = datetime.now()
        k = 0
        with gzip.open(file_name, "rt", encoding="utf-8", errors="strict") as f:
            line = f.readline()
            while line:
                # Structure: index, page, index, page,...
                line = f.readline()
                yield line, f.readline(), k
                k += 1

And I made the pipeline:


_beam_pipeline_args = [
    "--runner=DirectRunner",
    "--streaming",
    # "--direct_num_workers=5",
    # "--direct_running_mode=multi_processing",
]


with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
    pipeline = (
        pipeline
        | "Read dump" >> beam.io.Read(CountingSource(dump_path))
        | "With timestamps" >> beam.Map(lambda data: beam.window.TimestampedValue(data, data[-1]))
        | "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
        | "Process element" >> beam.ParDo(ProcessPage())
        | "Filter nones" >> beam.Filter(lambda data: data != [])
        # * not working, keep stuck at group - not triggering the window
        | "window"
        >> beam.WindowInto(
            beam.window.GlobalWindows(),
            trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(10)),
            accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING,
        )
        | "Map to tuple" >> beam.Map(lambda data: (None, data))
        # | "Print" >> beam.Map(lambda data: print(data))
        | "Group all per window" >> beam.GroupByKey()
        | "Discard key" >> beam.Values()
        | "Index data" >> beam.Map(index_data)
    )

If I remove the window and pass directly from "Filter nones" to "Index data" the pipeline works but indexing individually the elements. Also, If uncomment the print step I can see I still have data after the "Map to tuple" step, but it hangs on "Group all per window" without any logg. I tried with timed triggering too, changing the window to

        >> beam.WindowInto(
            beam.window.FixedWindows(10))

but this changed nothing (which was supposed to do the same as I create a "count time stamp" on data extraction). I'm understanding something wrong with the windowing? The objective was to just index the data in batches.

Alternative

I can "hack" this last step using a custom do.Fn like:

class BatchIndexing(beam.DoFn):
    def __init__(self, connection_string, batch_size=50000):
        self._connection_string = connection_string
        self._batch_size = batch_size
        self._total = 0

    def setup(self):
        from sqlalchemy import create_engine
        from sqlalchemy.orm import sessionmaker
        from scripts.wikipedia.wikipedia_articles.beam_module.documents import Base

        engine = create_engine(self._connection_string, echo=False)
        self.session = sessionmaker(bind=engine)(autocommit=False, autoflush=False)
        Base.metadata.create_all(engine)

    def start_bundle(self):
        # buffer for string of lines
        self._lines = []

    def process(self, element):
        # Input element is the processed pair
        self._lines.append(element)
        if len(self._lines) >= self._batch_size:
            self._total += len(self._lines)
            self._flush_batch()

    def finish_bundle(self):
        # takes care of the unflushed buffer before finishing
        if self._lines:
            self._flush_batch()

    def _flush_batch(self):
        self.index_data(self._lines)
        # Clear the buffer.
        self._lines = []

    def index_data(self, entries_to_index):
        """
        Index batch of data.
        """
        print(f"Indexed {self._total} entries")
        self.session.add_all(entries_to_index)
        self.session.commit()


and change the pipeline to:

with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
    pipeline = (
        pipeline
        | "Read dump" >> beam.io.Read(CountingSource(dump_path))
        | "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
        | "Process element" >> beam.ParDo(ProcessPage())
        | "Filter nones" >> beam.Filter(lambda data: data != [])
        | "Unroll" >> beam.FlatMap(lambda data: data)
        | "Index data" >> beam.ParDo(BatchIndexing(connection_string, batch_size=10000))
    )

Which "works" but do the last step in parallel (thus, overwhelming de database or generating locked database problems with sqlite) and I would like to have just one Sink to communicate with the database.

1 Answers1

1

Triggering in Beam is not a hard requirement. My guess would be that the trigger does not manage to trigger before the input ends. The early trigger of 10 elements means the runner is allowed to trigger after 10 elements, but does not have to (relates to how Beam splits inputs into bundles).

The FixedWindows(10) is fixed on 10 second interval and your data will all have the same timestamp, so that is not going to help either.

If your goal is to group data to batches there is a very handy transform for that: GroupIntoBatches, which should work for the use case and has additional features like limiting the time a record can wait in the batch before being processed.

Jan Lukavsky
  • 131
  • 4
  • 1- The input is not ending for sure because it's 30gb and I don't have RAM for that :). I'm doing streaming specifically for that. 2 - I create a custom timestamp that is k in the beginning, every 2 lines I read I increase k so every 10 elements (or 20 lines) I would have a 10 "units" difference of timestamp between the records so triggering it with the FixedWIndows(10) - as I have the line ´beam.window.TimestampedValue(data, data[-1])´ - right? 3 - If I use GroupIntoBatches I would need to read first all the data in memory, right? I can't do that..Maybe can I group a window? – Giovani Merlin Jan 14 '22 at 09:14
  • Thinking again in the GroupIntoBatches, it's possible to change my approach to do something like "read 100.000 lines - send a batch - process and save this batch - read again 100.000 lines"? The problem is I don't know if it's possible to do this closed-loop approach or if it's efficient... – Giovani Merlin Jan 14 '22 at 09:19
  • There might be an issue with how you use the `offset_range_tracker`, linking with [mailing list answer](https://lists.apache.org/thread/xpf11qlgvts2s1phf7b3gntn86kf51gh) for reference. – Jan Lukavsky Jan 17 '22 at 10:53