1

An sample skeleton code is sort of as follows, where i am basically reading a RDD from bigquery and select out all data point where my_field_name value is null

    JavaPairRDD<String, GenericData.Record> input = sc
            .newAPIHadoopRDD(hadoopConfig, AvroBigQueryInputFormat.class, LongWritable.class, GenericData.Record.class)
            .mapToPair( tuple -> {
                GenericData.Record record = tuple._2;
                Object rawValue = record.get(my_field_name); // Problematic !! want to get my_field_name of this bq row, but just gave something not making sense
                String partitionValue = rawValue == null ? "EMPTY" : rawValue.toString();
                return new Tuple2<String, GenericData.Record>(partitionValue, record);
            }).cache();
    JavaPairRDD<String, GenericData.Record> emptyData = 
            input.filter(tuple -> StringUtils.equals("EMPTY", tuple._1));
    emptyData.values().saveAsTextFile(my_file_path)

However the output RDD is totally seems totally unexpected. Especially the value of my_field_name seems totally random. After a little debugging, it seems filtering is do what is expected but the problem is on the value I extracted from GenericData.Record (basically record.get(my_field_name)) seems totally random.

Therefore after I switched from AvroBigQueryInputFormat to GsonBigQueryInputFormat to read bq in json instead, this code seems to be working correctly.

However, ideally I really I want to use Avro instead (which should be much faster than handling json) however its current behavior in my code is totally disturbing. My I just using the AvroBigQueryInputFormat wrong?

Xinwei Liu
  • 333
  • 6
  • 15

0 Answers0