I am using HBase as my storage for crawled data by Apache Nutch. A location of my storage is in path /data/hbase/webpage and there I can see a lot of folders like:
64b2feb30073eec24d9dba65d421e7f
482062bc554bd45bf198d9edea971a30
7c8a6eec12d9f6926a1d912be9a0ca81
c1f682541b8d1c0559de6df14ae84e2b
083b28ee75babc718cc28e66b98c9ff5
809eb4bb5f2be087e2c84a2f51d26653
and more...
These folders contains another folders like:
f h il mk mtdt ol p recovered.edits s
But it is not so important.
I am writing my own indexer for Nutch to get crawled data from HBase to Solr. I need to put it to Solr in batches because when I run it all, I get OutOfMemory exception.
I would like to ask you if it is possible to get batch ids from my HBase storage (to know which batch ids I have and then I can send it to index).