I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache beam and therefore, I want to use GroupByKey
to fill up the field for each group, and then ungroup it to restore the original data, now with filled data. The equivalent in pandas would be:
dataframe[column_to_be_filled] = dataframe.groupby(primary_key)[column_to_be_filled].ffill().bfill()
I don't know how to achieve this with apache beam. I first used apache beam dataframe, but that'd take a lot of memory.