GroupByKey to fill values and then ungroup apache beam

Question

I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache beam and therefore, I want to use GroupByKey to fill up the field for each group, and then ungroup it to restore the original data, now with filled data. The equivalent in pandas would be:

dataframe[column_to_be_filled] = dataframe.groupby(primary_key)[column_to_be_filled].ffill().bfill()

I don't know how to achieve this with apache beam. I first used apache beam dataframe, but that'd take a lot of memory.

I have, but it consumed a lot of memory hence not desirable. Also they fixed the sharding for csv name output but it hasnt been released yet: https://stackoverflow.com/questions/73498119/apache-beam-dataframe-write-csv-to-gcs-without-shard-name-template — pa-nguyen, Aug 31 '22 at 05:42
Could you expand on "consumed a lot of memory"? Was it not properly parallelizing things? — robertwb, Sep 07 '22 at 16:35

score 3 · Answer 1 · answered Aug 31 '22 at 09:22

It's better to process your elements with a pcollection instead of a dataframe to avoid memory issues.

First read your CSV as a pcollection and then you can use GroupByKey and process the grouped elements and yield the results with a separate transformation.

It could be something like this

(pcollection | 'Group by key' >> beam.GroupByKey()
                    | 'Process grouped elements' >> beam.ParDo(UngroupElements()))

The input pcollection should be list of tuples each one contains the key you want to group with and the element.

And the ptransformation would look like this:

class UngroupElements(beam.ParDo):
    
    def process(element):
        k, v = element
        for elem in list(v):
            # process your element 
            yield elem

score 1 · Answer 2 · answered Aug 31 '22 at 19:15

You can try to use exactly the same code as Pandas in Beam: https://beam.apache.org/documentation/dsls/dataframes/overview/

You can use read_csv to read your data into a dataframe, and then apply the same code that you would use in Pandas. Not all Pandas operations are supported (https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/), but that specific case with the group by key should work.

GroupByKey to fill values and then ungroup apache beam

2 Answers2