I have a large dataset in GCS in json format that I need to load into BigQuery. The problem is that the json data is not stored in NdJson but rather in a few large json files, where each key in the JSON should really be a field in json itself.
For example - the following Json:
{
"johnny": {
"type": "student"
},
"jeff": {
"type": "teacher"
}
}
should be converted into
[
{
"name": "johnny",
"type": "student"
},
{
"name": "jeff",
"type": "teacher"
}
]
I am trying to solve it via Google Data Flow an Apache Beam, but the performance is terrible since ech "Worker" has to do a lot of work:
class JsonToNdJsonDoFn(beam.DoFn):
def __init__(self, pk_field_name):
self.__pk_field_name = pk_field_name
def process(self, line):
for key, record in json.loads(line).items():
record[self.__pk_field_name] = key
yield record
I know that this can solved somehow via implementing it as a SplittableDoFn - but the implementation example in Python there is not really clear. How should I build this DoFn as splittable, and how will it be used as part of the pipeline?