0

I'm quite new to Apache Beam and I'm trying to create a new column at index 0 that uses groupbykey on multiple columns and construct a new unique id. How am I able to achieve this?

Also I want to write the new data to a newline delimited JSON format file (where each line is one unique_id with an array of objects that belong to that unique_id.

I've currently written:

import apache_beam as beam

pipe = beam.Pipeline()

id = (pipe
            |beam.io.ReadFromText('data.csv')
            |beam.Map(lambda x:x.split(","))
            |beam.Map(print))

Which basically converts each row into a list of strings.

This post has the sample data input and the solutions use pandas to do so but how do I achieve the same in the pipeline using Beam?

Thank you!

sk97
  • 3
  • 3

2 Answers2

0

Have you tried CombinePerKey like this?

import apache_beam as beam

p = beam.Pipeline()

test = (
    p
    | beam.Create([(0, "ttt"), (0, "ttt1"), (0, "ttt2"), (1, "xxx"), (1, "xxx2"), (2, "yyy")])
    | beam.CombinePerKey(lambda v: ",".join(v))
    | beam.Map(print)
)
XQ Hu
  • 141
  • 4
  • Thanks for your input. Could you please tell me how I would implement this on an existing csv PCollection? @XQ Hu – sk97 Feb 26 '23 at 09:43
  • Assuming t.txt has the data like these: ``` 0, "ttt" 0, "ttt1" 0, "ttt2" 1, "xxx" 1, "xxx2" 2, "yyy" ``` I could do this: ``` import apache_beam as beam p = beam.Pipeline() test = ( p | beam.io.ReadFromText("t.txt") | beam.Map(lambda x: x.split(",")) | beam.CombinePerKey(lambda v: ",".join(v)) | beam.Map(print) ) ``` – XQ Hu Feb 27 '23 at 14:34
0

Is it important to you to have the unique IDs be integers from 0 to n_groups like in your linked example?

If not, then I don't think there's any need to use a grouping operation here. Consider the following:

import apache_beam as beam

def make_unique_id(row):
  """
  Example function for extracting a unique ID from the row.

  You could wrap the value in uuid.UUID to make a more standard format for the ID.
  """
  return ",".join([row[0], row[1]])

pipe = beam.Pipeline()

id = (pipe
            | beam.io.ReadFromText('data.csv')
            | beam.Map(lambda x: x.split(","))
            | beam.Map(lambda x: [make_unique_id(x)] + x)
            | beam.Map(print))

Jeff Klukas
  • 1,259
  • 11
  • 20
  • Hi. Thanks for your response. Yeah, I want the unique ID's to be integers like the one showed in the post. Would you be able to suggest a different way to achieve that? Thanks. – sk97 Feb 28 '23 at 13:15