Left join with CoGroupByKey sink to BigQuery using Dataflow

Question

I would like to join files (expeditions- 2010s.csv and peaks.csv) using join key "peakid" with CoGroupByKey. However, there is an error when I sink it to BigQuery: RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_88_215864ba592a2e01f0c4e2157cc60c47_86e3562707f348c29b2a030cb6ed7ded failed. Error Result: <ErrorProto location: 'gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. File: gs://bucket-name/input/temp/bq_load/ededcfb43cda4d16934011481e2fd774/project_name.dataset.expeditions/9fe30f70-8473-44bc-86d5-20dfdf59f502' reason: 'invalid'> [while running 'Write To BigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs'].

Please review code as below:

def read_csv_pd_input1(readable_file):
    import json
    import pandas as pd   
    import csv
    import io
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
    df = pd.DataFrame(csv_dict)
    df = df[['peakid', 'bcdate', 'smtdate']]
    
    a = df.set_index('peakid')[['bcdate', 'smtdate']].apply(tuple,1).to_dict()
    a = tuple(a.items())
    
    # result: only column name   
    # a = df.agg(lambda x: (x.values)).apply(tuple)

    # result: only value but not as expected    
    # a = [tuple(x) for x in df.values]
    # a = tuple(a)
    return a

def read_csv_pd_input3(readable_file):
    import json
    import pandas as pd   
    import csv
    import io
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
    df = pd.DataFrame(csv_dict)
    df = df[['peakid', 'pkname', 'heightm']] 
    
    a = df.set_index('peakid')[['pkname', 'heightm']].apply(tuple,1).to_dict()
    a = tuple(a.items())
    
    return a


def run(argv=None):
    import apache_beam as beam
    import io

    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        default='gs://bucket-name/input/expeditions- 2010s.csv')
    
    parser.add_argument(
        '--input3',
        dest='input3',
        required=False,
        help='Input_p3 file to read. This can be a local file or '
        'a file in a Google Storage Bucket.',
        default='gs://bucket-name/input/peaks.csv')
     
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)

    p = beam.Pipeline(options=PipelineOptions(pipeline_args))
    input_p1 = (
        p
         | 'Read From GCS input1' >> beam.Create([known_args.input])
         | 'Pair each employee with key p1' >> beam.FlatMap(read_csv_pd_input1)
         # | beam.Map(print)
        
    )
    input_p3 = (
        p
         | 'Read From GCS input3' >> beam.Create([known_args.input3])
         | 'Pair each employee with key p3' >> beam.FlatMap(read_csv_pd_input3)
    )
    # CoGroupByKey: relational join of 2 or more key/values PCollection. It also accept dictionary of key value
    output = (
        {'input_p1': input_p1, 'input_p3': input_p3} 
        | 'Join' >> beam.CoGroupByKey()
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(
           table='project_name:dataset.expeditions',
           schema='peakid:STRING,bcdate:DATE,pkname:STRING,heightm:INTEGER',
           method='FILE_LOADS',
           custom_gcs_temp_location='gs://bucket-name/input/temp',
           create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
           write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)    
    )
    p.run().wait_until_finish()
    # runner = DataflowRunner()
    # runner.run_pipeline(p, options=options)

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

score 1 · Answer 1 · answered Sep 08 '22 at 16:22

1

This part of the pipeline is wrong:

  | 'Join' >> beam.CoGroupByKey()
        | 'Write To BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(...

The output of CoGroupByKey will have the format key, {'input_p1': [list_of_p1_elems_with_key], 'input_p3': [list_of_p3_elems_with_key]}. You need to process that output to map it to the schema expected by the BigQuery sink.

Because the schema of the data does not match the schema specified in the BigQuery sink, the ingestion of data fails.

The Beam programming guide has an example of how to process the output of CoGroupByKey, and the transform catalog has an example too.

answered Sep 08 '22 at 16:22

Israel Herraiz

611
3
8

Yes, I know. I just don't know how to format it like schema expected. – Nhu Dao Sep 09 '22 at 01:37
You can produce dictionaries, with the different keys (`peak_id`, `bcdate`, etc). For the string and integer values, just make sure you are writing string or integers. For the date, you can produce strings with [YYYY-MM-DD format](https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_functions#cast_as_string) (see the conversion rules table in the previous link), and those will be [coerced](https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_rules#coercion) by BigQuery into dates at the ingestion moment – Israel Herraiz Sep 09 '22 at 07:01
Could you please give me an example ? – Nhu Dao Sep 13 '22 at 01:40
See answer with the code properly formatted. The code in this note cannot be formatedd. – Israel Herraiz Sep 13 '22 at 08:06

score 0 · Answer 2 · answered Sep 13 '22 at 08:10

I am not sure exactly how the columns of p1 and p3 are used to populate the BigQuery table. But other than that, after the beam.CoGroupByKey you could apply a beam.Map with a function similar to this one:

def process_group(kv):
  key, values = kv
  input_p1_list = values['input_p1']
  input_p3_list = values['input_p3']
  for p1 in input_p1_list:
    for p3 in input_p3_list:
       row_for_bq = {'peak_id': key, 'bcdate': p1['something'], 'heightm': p3['something'] }
       yield row_for_bq

Left join with CoGroupByKey sink to BigQuery using Dataflow

2 Answers2