0

I need to import everyday a file containing the yesterday's snapshot of a database. To import I use the following command in the shell:

./bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
    '-Dimporttsv.separator=|' \
    -Dimporttsv.columns=HBASE_ROW_KEY,info:date,info:author,info:text \
    tableName \
    inputFile.tsv

The problem is that each line contains all the values and not just the updated ones, resulting to have several versions for each column but with the same value.

There is any other way to import this daily snapshot ignoring the duplicate values? Or any suggestion to workaround this?

Thank you!

nmc
  • 43
  • 4

1 Answers1

0

I guess that if you really want to ignore existing values you'd need to write your own map/reduce instead of using the import program.

However, what's the problem with multiple versions? First off you can set the number of version hbase keeps (when you define a column family) secondly when you read you can read just the latest version and lastly, if you are worried about storage you can set up hbase to use compression

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • I'm getting closer to that answer, writing my own map/reduce function to import my data. About the multiple versions it is not a real problem, as you said, I can save every version just expanding my disk capacity. My question was how to avoid saving the save value twice, having this way several versions of the same value on the same cell. – nmc Aug 21 '12 at 21:22
  • What I said that is that if you use compression you don't have to worry much about capacity an if you set version to be 1 for the column family hbase will remove duplicates upon compactization. – Arnon Rotem-Gal-Oz Aug 22 '12 at 03:58
  • 1
    My concern is not about disk space. I'm working in a project that needs to know what is the state of the "world" at some point in time, eg: "for this cell, what is the value a month ago?". For this I need the cell be versioned. But if I update my cell every single day without being really updated (because is the same value for example), I'm wasting versions :/ Do you think is not a problem, in terms of Hbase performance, to set the maximum versions of a column family to be, lets say, 365 (enough to hold a year history)? Thank you for your help! – nmc Aug 22 '12 at 09:19
  • for hbase every "cell" (key+column family+ qualifier + timestamp + version + value) is a new row and as far as I can tell it shouldn't matter. In any event you can always add the update time (max - update time actuallu) as part of the key and don't use hbase versions. a get by prefix would get you the latest, a full key would get you a specific version and you won't be limited by 365 versions (or you can set time to live of 365 days and hbase will clear old value) – Arnon Rotem-Gal-Oz Aug 22 '12 at 20:01