0

I am setting up a Java Pipeline in DataFlow to read a .csv file and to create a bunch of BigTable rows based on the content of the file. I see in the BigTable documentation the note that connecting to BigTable is an 'expensive' operation and that it's a good idea to do it only once and to share the connection among the functions that need it.

However, if I declare the Connection object as a public static variable in the main class and first connect to BigTable in the main function, I get the NullPointerException when I subsequently try to reference the connection in instances of DoFn sub-classes' processElement() function as part of my DataFlow pipeline.

Conversely, if I declare the Connection as a static variable in the actual DoFn class, then the operation works successfully.

What is the best-practice or optimal way to do this?

I'm concerned that if I implement the second option at scale, I will be wasting a lot of time and resources. If I keep the variable as static in the DoFn class, is it enough to ensure that the APIs don't try to re-establish the connection every time?

I realize there is a special BigTable I/O call to sync DataFlow pipeline objects with BigTable, but I think I need to write one on my own to build-in some special logic into the DoFn processElement() function...

This is what the "working" code looks like:

class DigitizeBT extends DoFn<String, String>{
    private static Connection m_locConn;

    @Override
    public void processElement(ProcessContext c)
    {       
        try
        {
            m_locConn = BigtableConfiguration.connect("projectID", "instanceID");
            Table tbl = m_locConn.getTable(TableName.valueOf("TableName"));

            Put put = new Put(Bytes.toBytes(rowKey));

            put.addColumn(
                Bytes.toBytes("CF1"),
                Bytes.toBytes("SomeName"),
                Bytes.toBytes("SomeValue"));

            tbl.put(put);
        }
        catch (IOException e)
        {
            e.printStackTrace();
            System.exit(1);
        }
    }
}

This is what updated code looks like, FYI:

    public void SmallKVJob()
    {
        CloudBigtableScanConfiguration config = new CloudBigtableScanConfiguration.Builder()
                .withProjectId(DEF.ID_PROJ)
                .withInstanceId(DEF.ID_INST)
                .withTableId(DEF.ID_TBL_UNITS)
                .build();

        DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
        options.setProject(DEF.ID_PROJ);
        options.setStagingLocation(DEF.ID_STG_LOC);
//      options.setNumWorkers(3);
//      options.setMaxNumWorkers(5);        
//      options.setRunner(BlockingDataflowPipelineRunner.class);
        options.setRunner(DirectPipelineRunner.class);
        Pipeline p = Pipeline.create(options);

        p.apply(TextIO.Read.from(DEF.ID_BAL))
        .apply(ParDo.of(new DoFn1()))
        .apply(ParDo.of(new DoFn2()))
        .apply(ParDo.of(new DoFn3(config)));

        m_log.info("starting to run the job");
        p.run();
        m_log.info("finished running the job");
    }
}

class DoFn1 extends DoFn<String, KV<String, Integer>>
{
    @Override
    public void processElement(ProcessContext c)
    {
        c.output(KV.of(c.element().split("\\,")[0],Integer.valueOf(c.element().split("\\,")[1])));
    }
}

class DoFn2 extends DoFn<KV<String, Integer>, KV<String, Integer>>
{
    @Override
    public void processElement(ProcessContext c)
    {
        int max = c.element().getValue();
        String name = c.element().getKey();
        for(int i = 0; i<max;i++)
            c.output(KV.of(name,  1));
    }
}

class DoFn3 extends AbstractCloudBigtableTableDoFn<KV<String, Integer>, String>
{   
    public DoFn3(CloudBigtableConfiguration config)
    {
        super(config);
    }

    @Override
    public void processElement(ProcessContext c) 
    {
        try
        {
            Integer max = c.element().getValue();
            for(int i = 0; i<max; i++)
            {
                String owner = c.element().getKey();
                String rnd = UUID.randomUUID().toString();  

                Put p = new Put(Bytes.toBytes(owner+"*"+rnd));
                p.addColumn(Bytes.toBytes(DEF.ID_CF1), Bytes.toBytes("Owner"), Bytes.toBytes(owner));
                getConnection().getTable(TableName.valueOf(DEF.ID_TBL_UNITS)).put(p);
                c.output("Success");
            }
        } catch (IOException e)
        {
            c.output(e.toString());
            e.printStackTrace();
        }
    }
}

The input .csv file looks something like this:
Mary,3000
John,5000
Peter,2000
So, for each row in the .csv file, I have to put in x number of rows into BigTable, where x is the second cell in the .csv file...

VS_FF
  • 2,353
  • 3
  • 16
  • 34
  • As an update, I just did a test -- it took >15 mins to put in 10,000 simple rows into BigTable via a DataFlow pipeline, which had to read only 5 rows from a .csv file (each row read from .csv had to insert a few thousand rows into BigTable). DataFlow had to upgrade the job to 5 workers to accommodate. Clearly this is not optimal.... – VS_FF Dec 12 '16 at 11:26

1 Answers1

1

We built AbstractCloudBigtableTableDoFn ( Source & Docs ) for this purpose. Extend that class instead of DoFn, and call getConnection() instead of creating a Connection yourself.

10,000 small rows should take a second or two of actual work.

EDIT: As per the comments, BufferedMutator should be used instead of Table.put() for best throughput.

Drew
  • 6,311
  • 4
  • 44
  • 44
Solomon Duskis
  • 2,691
  • 16
  • 12
  • Tried this, still quite slow. Takes ~10min to put in 10,000 rows as a DataFlow job using 3 workers. Surprisingly, takes ~2min to put in same 10,000 rows, but as a local DataFlow job running from my own computer -- still pretty slow.What's interesting is that BigTable never showed Write load above 70 per/sec, so it's not like a bigger cluster is needed. – VS_FF Dec 15 '16 at 18:17
  • I included the full updated code in the original post, including both DataFlow and BigTable components, in case you can see that I'm still doing something extremely inefficient... – VS_FF Dec 15 '16 at 18:23
  • I'd change `getConnection().getTable(TableName.valueOf(DEF.ID_TBL_UNITS))` once to BufferedMutator mutator = getConnection().getBufferedMutator(TableName.valueOf(DEF.ID_TBL_UNITS)) outside of the for loop, and then call mutator.mutate() instead of table.put(). After the loop, mutator.close() to send all of the data. Please be aware of the work needed to create the dataflow workers (GCE instances). It will be significantly higher than the bigtable work done by the workers. It would be better to test with 1,000,000 rows rather than 10,000 for this. – Solomon Duskis Dec 15 '16 at 21:59
  • Table.put(put) takes 3-5 milliseconds (conservatively) each if you're doing small operation in GCE in the same zone. Mutator.mutate() combines multiple puts into a single RPC, and performs the RPC asynchronously. If you're looking for overall throughput (i.e. the volume of operations matters more than the total time of each individual operations), the BufferedMutator is the right solution. You can also table.put(List) for smaller batches (up to 500?) – Solomon Duskis Dec 16 '16 at 03:26
  • Just to update for others' sake: changing to BufferedMutator brings amazing performance improvement. 10,000 rows in 3.5sec. on average, from local remote computer. Haven't even bothered to test as a server-based DataFlow job yet. You are right of course, starting the workers would be the longest operation, and of course only makes sense once the actual load is much higher. – VS_FF Dec 16 '16 at 10:30
  • one more question please -- is BufferedMutator safe for DataFlow? I see in documentation it's thread-safe, but is there any coordination among various DataFlow workers? If I do BufferedMutator.mutate(new Put(XYZ)) on worker1, is there a possibility that a similar operation on XYZ may be performed by another DataFlow worker before I do BufferedMutator.close() on worker1? – VS_FF Dec 28 '16 at 11:11
  • Each buffered mutator acts independently. Updating the same exact operation on two different buffered mutators in the same VM would be the same as a running that same operations on different VMs. The end result is generally two different copies of the same data with different timestamps. The copy is mostly ignored, and is removed when a major compaction occurs. – Solomon Duskis Dec 28 '16 at 15:10