An sample skeleton code is sort of as follows, where i am basically reading a RDD from bigquery and select out all data point where my_field_name value is null
JavaPairRDD<String, GenericData.Record> input = sc
.newAPIHadoopRDD(hadoopConfig, AvroBigQueryInputFormat.class, LongWritable.class, GenericData.Record.class)
.mapToPair( tuple -> {
GenericData.Record record = tuple._2;
Object rawValue = record.get(my_field_name); // Problematic !! want to get my_field_name of this bq row, but just gave something not making sense
String partitionValue = rawValue == null ? "EMPTY" : rawValue.toString();
return new Tuple2<String, GenericData.Record>(partitionValue, record);
}).cache();
JavaPairRDD<String, GenericData.Record> emptyData =
input.filter(tuple -> StringUtils.equals("EMPTY", tuple._1));
emptyData.values().saveAsTextFile(my_file_path)
However the output RDD is totally seems totally unexpected. Especially the value of my_field_name seems totally random. After a little debugging, it seems filtering is do what is expected but the problem is on the value I extracted from GenericData.Record
(basically record.get(my_field_name)
) seems totally random.
Therefore after I switched from AvroBigQueryInputFormat to GsonBigQueryInputFormat to read bq in json instead, this code seems to be working correctly.
However, ideally I really I want to use Avro instead (which should be much faster than handling json) however its current behavior in my code is totally disturbing. My I just using the AvroBigQueryInputFormat wrong?