-1

I am trying to convert BigTable table data to genric record using dataflow .After the conversion is done , i have to compare with another datasets in bucket . Below is my pseudo code , for pipeline i have used

  pipeline
     .apply("Read from bigtable", BigTableIo.read)
     .apply("Transform BigTable to Avro Genric Records ",
         ParDo.of(new TransformAvro(out.toString())))
     .apply("Compare to existing avro file ")
     .apply("Write back the data to bigTable")

// Function code is below to convert genric record     

 public class BigTableToAvroFunction
    extends DoFn<KV<ByteString, Iterable<Mutation>>, GenericRecord>  {
       @ProcessElement
       public void processelement(ProcessContext context){
         GenericRecord gen = null ;
         ByteString key = context.element().getKey();
         Iterable<Mutation> value  = context.element().getValue();
         KV<ByteString, Iterable<Mutation>> element = context.element(); 
 } 

I am stuck here .

Anton
  • 2,431
  • 10
  • 20
  • I don't think a [GenericRecord](https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/generic/GenericData.Record.html) is quite what you want to convert to here. If you already have Avro files you can generate a Java class from them and then do the transform from [HBase Result](https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html) to that class, or create your own record class for comparing. – Dan Jan 11 '19 at 14:48

1 Answers1

1

It is unclear what do you mean by comparing to existing data in a bucket. It depends on how do you want to do the comparison, what the file size is, probably other things. Examples of input vs output would help.

For example, if what you're trying to do is similar to Join operation, you can try using CoGroupByKey (link to the doc) to join two PCollections, one reading from BigTable, another reading Avros from GCS.

Or alternatively, if the file has reasonable size (fits in memory), you can probably model it as a side input (link to the doc).

Or, ultimately you can always use raw GCS API to query the data in a ParDo and doing everything manually.

Anton
  • 2,431
  • 10
  • 20
  • I am trying to write a function which reads data from bigtable and convert ByteString to Genric record . – user10881000 Jan 07 '19 at 22:37
  • how do I join big table and avro in pcollections ,since after reading big table it gives different datatype and avro is generic record . Any example would be very useful . – user10881000 Jan 08 '19 at 12:18