0

From this question: How to group data and construct a new column - python pandas?, I know how to groupby multiple columns and construct a new unique id by using pandas, but if I want to use Apache beam in Python to achieve the same thing that is described in that question, how can I achieve it and then write the new data to a newline delimited JSON format file (each line is one unique_id with an array of objects that belong to that unique_id)?

Assuming the dataset is stored in a csv file.

I'm new to Apache beam, here's what I have now:

import pandas
import apache_beam as beam
from apache_beam.dataframe.io import read_csv

with beam.Pipeline() as p:
    df = p | read_csv("example.csv", names=cols)
    agg_df = df.insert(0, 'unique_id',
          df.groupby(['postcode', 'house_number'], sort=False).ngroup())
    agg_df.to_csv('test_output')        

This gave me an error:

NotImplementedError: 'ngroup' is not yet supported (BEAM-9547)

This is really annoying, I'm not very familiar with Apache beam, can someone help please...

(ref: https://beam.apache.org/documentation/dsls/dataframes/overview/)

wawawa
  • 2,835
  • 6
  • 44
  • 105

1 Answers1

0

Assigning consecutive integers to a set is not something that's very amenable to parallel computation. It's also not very stable. Is there any reason another identifier (e.g. the tuple (postcode, house_number) or its hash would not be suitable?

robertwb
  • 4,891
  • 18
  • 21
  • Hi I don't follow your question, are you asking why I need to construct another identifier? – wawawa Aug 16 '21 at 18:28
  • Yes, or why this identifier can't be an (injective) function of postcode and house_number. – robertwb Aug 16 '21 at 18:32
  • The value for the identifier doesn't matter, it's only used to identify the same property. – wawawa Aug 16 '21 at 18:33
  • In that case, simply concatenate the postcode and house number and use that as an identifier (e.g. `df['unique_id'] = df['postcode'] + '-' + df['house_number ']`) – robertwb Aug 16 '21 at 19:49