2

Hi i have developed an application where i have to store TB of data for the first time and then 20 GB monthly incremental like insert/update/delete in the form of xml that will be applied on top of this 5 TB of data . And finally on request basis i have to generate full snapshot of all data and create 5K text files based on the logic so that respective data should be in the respective files .

I have done this project using HBase . I have created 35 tables in the HBase having region from 10 to 500 . I have my data in my HDFS and the using mapreduce i bulk load data into receptive Hbase tables .

After that i have SAX parser application written in java to parse all incoming xml incremental files and update HBase tables .The frequency of the xml files are approx 10 xml files per minutes and total of 2000 updates . The incremental message are strictly in order .

Finally on request basis i run my last mapreduce application to scan all Hbase table and create 5K text files and deliver it to the client .

All 3 steps are working fine but when i went to deploy my application on production server that is shared cluster ,the infrastructure team are not allowing us to run my application because i do full table scan on HBase .

I have used 94 node cluster and the biggest HBase table data that i have is approx 2 billions .All other tables has less than a millions of data .

Total time for mapreduce to scan and create text files takes 2 hours.

Now i am looking for some other solution to implement this .

I can use HIVE because i have records level insert/update and delete that too in very precise manner.

I have also integrated HBase and HIVE table so that for incremental data HBase table will be used and for full table scan HIVE will be used . But as HIVE uses Hbase storage handler i cant create partition in HIVE table and that is why HIVE full table scan becomes very very slow even 10 times slower that HBase Full table scan

I cant think of any solution right now kind of stuck . Please help me with some other solution where HBase is not involved .

Can i use AVRO or perquet file in this use case .But i am not sure how AVRO will support record level update .

Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83

2 Answers2

2

I will answer my question . My issue is that i dont want to perform full table scan on Hbase because it will impact performance of the region server and specially on the shared cluster it will hit the read wright performance on of the HBase .

So my solution using Hbase because it is very good for the update specially delta update that is columns update .

So in order to avoid that Full table scan take snapshot of HBase table export it to the HDFS and them run full table scan on the Hbase table snapshot.

Here is the detailed steps for the process

Create snapshot

snapshot 'FundamentalAnalytic','FundamentalAnalyticSnapshot'

Export Snapshot to local hdfs

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot FundamentalAnalyticSnapshot -copy-to /tmp -mappers 16

Driver Job Configuration to rum mapreduce on Hbase snapshot

String snapshotName="FundamentalAnalyticSnapshot";
Path restoreDir = new Path("hdfs://quickstart.cloudera:8020/tmp");
String  hbaseRootDir =  "hdfs://quickstart.cloudera:8020/hbase";



 TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Snapshot name
                        scan, // Scan instance to control CF and attribute selection
                        DefaultMapper.class, // mapper class
                        NullWritable.class, // mapper output key
                        Text.class, // mapper output value
                        job,
                        true,
                        restoreDir);

Also running mapreduce on Hbase snapshot will skip scan on Hbase table and also there will be no impact on region server.

Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83
-1

The key to using HBase efficiently is DESIGN. With a good design you will never have to do full scan. That is not what HBase was made for. Instead you could have been doing a scan with Filter - something HBase was built to handle efficiently.

I cannot check your design now but I think you may have to.

The idea is not to design a HBase table the way you would have an RDBMS table and the key is designing a good rowkey. If you rowKey is well built, you should never do a full scan.

You may also want to use a project like Apache Phoenix if you want to access you table using other columns other than row key. It also performs well. I have a good experience with Phoenix.

okmich
  • 700
  • 5
  • 11
  • I dont have performance problem .I get good performance as i have tried with 2 billions records having 400 GB data size and i get my job completed in 12 minutes .But my concern is full table Scan impact the cluster performance if my application is deployed on shared cluster? – Sudarshan kumar May 21 '17 at 14:36
  • Of course, a full table scan will affect performance. You want to try redesigning the rowkey and then using Filters. Or deploy Apache Phoenix, then access the HBase tables like you would any table in an RDBMS. – okmich May 21 '17 at 19:00