Use Apache Beam to GroupByKey and read_csv and export to ndjson

Question

I have a csv file without any headers that contains unique transaction IDs of various properties that have been bought and sold. Details of the dataset can be found here under 2022. I am trying to use Apache Beam to construct a new key that groups the columns as done in this post. Although Beam has the DataFrame API, I am unsure whether it is scalable and to what extent I can use Pandas commands to conduct data transformation on the pipeline. I used Beams transformation GroupByKey as follows:

import apache_beam as beam

p = beam.Pipeline()

id = (p
      |beam.io.ReadFromText('pp-2022.csv') #Reads the file as a PCollection
            |beam.Map(lambda x:x.split(",")) #Splits it based on the commas and returns a list of strings
            |beam.Map(lambda x: (x[3] + ',' + x[7],x)) #Groups the 3rd column which is the 'Postcode' and 7th column 'House No. or PAON'
            |beam.GroupByKey()
            |beam.Map(print)
)

Upon running the pipeline I get this PCollection :

('"TA24 5SZ","THE HOPCOTT"', [['"{EC7AD09A-350F-9200-E053-6C04A8C0E306}"', '"139000"', '"2022-08-26 00:00"', '"TA24 5SZ"', '"F"', '"N"', '"L"', '"THE HOPCOTT"', '"FLAT 8"', '"HOPCOTT ROAD"', '""', '"MINEHEAD"', '"SOMERSET WEST AND TAUNTON"', '"SOMERSET"', '"A"', '"A"'], 
['"{E53EDD2E-3DD4-83EC-E053-6B04A8C03A59}"', '"162500"', '"2022-06-17 00:00"', '"TA24 5SZ"', '"F"', '"N"', '"L"', '"THE HOPCOTT"', '"FLAT 11"', '"HOPCOTT ROAD"', '""', '"MINEHEAD"', '"SOMERSET WEST AND TAUNTON"', '"SOMERSET"', '"A"', '"A"']])

Although this groups it accurately, I'm wondering how I would be able to:

Deal with missing values as one of the columns I'm grouping with has multiple missing values?
I want the pipeline to be able to convert this to a newline delimited JSON format or .ndjson where the output would look something like

"TA24 5SZ","THE HOPCOTT", [{"{EC7AD09A-350F-9200-E053-6C04A8C0E306}", "139000", "2022-08-26 00:00", "TA24 5SZ", "F", "N", "L", "THE HOPCOTT", "FLAT 8", "HOPCOTT ROAD", "", "MINEHEAD", '"SOMERSET WEST AND TAUNTON"', '"SOMERSET"', '"A"', "A"}, 
      {{E53EDD2E-3DD4-83EC-E053-6B04A8C03A59}"', '"162500"', '"2022-06-17 00:00"', '"TA24 5SZ"', '"F"', '"N"', '"L"', '"THE HOPCOTT"', '"FLAT 11"', '"HOPCOTT ROAD"', '""', '"MINEHEAD"', '"SOMERSET WEST AND TAUNTON"', '"SOMERSET"', '"A"', '"A"'}]

This means, for property "TA24 5SZ","THE HOPCOTT", it has two transactions records in an array, each transaction record is in a {} (this format can vary, as long as it's a newline delimited JSON file).

Is there a way to obtain this using the Beam DataFrame API and is scalable? If so, how?

I have never used Beam before so any input/advice will help! Thank you.

Use Apache Beam to GroupByKey and read_csv and export to ndjson

0 Answers0