2

Is there any api available to delete a specific HBase cell using Spark Scala. We are able to read and write using Spark-HBase Connector. Any suggestion for cell deletion is highly appreciable.

1 Answers1

2

Here is an implementation for deletion of HBase Cell objects, using Spark (I demonstrated it using parallelize, you can adjust it to your Cells RDD).

General idea: removal in chunks - iterates through each RDD partition, splitting the partition to chunks of 10,000 Cells, converting each Cell to HBase Delete object, then calling table.delete() to perform the deletion from HBase.

public void deleteCells(List<Cell> cellsToDelete) {

    JavaSparkContext sc = new JavaSparkContext();

    sc.parallelize(cellsToDelete)
        .foreachPartition(cellsIterator -> {
            int chunkSize = 100000; // Will contact HBase only once per 100,000 records

            org.apache.hadoop.conf.Configuration config = new org.apache.hadoop.conf.Configuration();
            config.set("hbase.zookeeper.quorum", "YOUR-ZOOKEEPER-HOSTNAME");

            Table table;

            try {
                Connection connection = ConnectionFactory.createConnection(config);
                table = connection.getTable(TableName.valueOf(config.get("YOUR-HBASE-TABLE")));
            }
            catch (IOException e)
            {
                logger.error("Failed to connect to HBase due to inner exception: " + e);

                return;
            }

            // Split the given cells iterator to chunks
            Iterators.partition(cellsIterator, chunkSize)
                .forEachRemaining(cellsChunk -> {
                    List<Delete> deletions = Lists.newArrayList(cellsChunk
                            .stream()
                            .map(cell -> new Delete(cell.getRowArray(), cell.getRowOffset(), cell.getRowLength())
                                    .addColumn(cell.getFamily(), cell.getQualifier(), System.currentTimeMillis()))
                            .iterator());

                    try {
                        table.delete(deletions);
                    } catch (IOException e) {
                        logger.error("Failed to delete a chunk due to inner exception: " + e);
                    }
                });

        });
}

Disclaimer: this exact snippet was not tested, but I have used the same method for removal of billions of HBase Cells using Spark.

imriqwe
  • 1,455
  • 11
  • 15
  • Thank you !! I will try the same in scala. – knowledge-seeker Mar 24 '16 at 05:00
  • sorry for taking so long to try it.I tried the below code in scala which executes without any error,however does not delete any data.I am wondering what did I miss. `import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Delete; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.util.Bytes; val conf = HBaseConfiguration.create(); val table = new HTable(conf, "mytable"); val delete = new Delete(Bytes.toBytes(1)); delete.deleteColumn(Bytes.toBytes("mycf"), Bytes.toBytes("name")); table.delete(delete);` – knowledge-seeker Apr 17 '16 at 14:13
  • Try to perform a Get action and see if you can retrieve the cells from HBase. – imriqwe Apr 17 '16 at 17:58
  • The Get operation works fine, I tried the below code. `val g = new Get(Bytes.toBytes("1")); val result = table.get(g); val value = result.getValue(Bytes.toBytes("mycf"),Bytes.toBytes("name")); val name = Bytes.toString(value); ` The data in my table is as below hbase(main):001:0> scan 'mytable' ROW COLUMN+CELL 1 column=mycf:name, timestamp=1460540729352, value=Name1 1 column=mycf:prg, timestamp=1460540729352, value=1 – knowledge-seeker Apr 18 '16 at 05:22
  • Good. Did you provide a timestamp when performed the Delete? – imriqwe Apr 18 '16 at 05:42
  • I tried with below code for providing the timestamp, Still I get the same result. I am executing the code through spark-shell with cdh 5.5. I hope there will no issues with that. `val delete = new Delete(Bytes.toBytes(1)); delete.deleteColumn(Bytes.toBytes("mycf"), Bytes.toBytes("name"),1460540729352l); table.delete(delete);` – knowledge-seeker Apr 18 '16 at 06:01
  • And the value still exists if you perform a Get? Weird.. Looks ok. – imriqwe Apr 18 '16 at 06:08
  • I am able to delete the table but not the row or cells. :) – knowledge-seeker Apr 18 '16 at 06:19
  • And if you try to delete the entire row? – imriqwe Apr 18 '16 at 06:21
  • I tried with `val delete = new Delete(Bytes.toBytes(1));table.delete(delete)` as well as `val delete = new Delete(Bytes.toBytes(1),1460540729352l);table.delete(delete)` , but still no luck with deleting the row. – knowledge-seeker Apr 18 '16 at 06:32
  • Finally I am able to delete the cell and rows. Added the below code for cell deletion. `val marker = new KeyValue(rowKey, family, Bytes.toBytes("name"), HConstants.LATEST_TIMESTAMP, KeyValue.Type.DeleteColumn); table.delete(new Delete(rowKey).addDeleteMarker(marker));` Thanks a lot for all the help. :) – knowledge-seeker Apr 18 '16 at 08:31