2

There is a dataset of ~5GB in size. This big dataset just has a key-value pair per line. Now this needs to be read for the value of keys some billion times.

I have already tried disk based approach of MapDB, but it throws ConcurrentModification Exception and isn't mature enough to be used in production environment yet.

I also don't want to have it in a DB and make the call billion times (Though, certain level of in-memory caching can be done here).

Basically, I need to access these key-value dataset in mapper/reducer of a hadoop's job step.

Amar
  • 11,930
  • 5
  • 50
  • 73
  • How many keys are there? What's the format, simple text like `key=value` or something binary? How does the data look, numeric/string keys? – Philipp Reichart Dec 04 '12 at 15:38
  • It's not that big. It even will fit in memory using the adequate machine. In that case you could put it in the distributed cache. – Jorge González Lorenzo Dec 04 '12 at 15:40
  • @PhilippReichart : To be precise, there are 103302034 keys. And yes it is a csv with per line having `key,value`. – Amar Dec 04 '12 at 15:45
  • @JorgeGonzálezLorenzo yeah, but distributed cache will only make a copy of this file in every node, it will not help me in accessing the values readily right? – Amar Dec 04 '12 at 15:46
  • How do the keys look? Are there duplicate keys? How exactly do you want to query them, i.e. read/access patterns? If keys can be represented as 32-bit integers, the whole key set would fit into <400 MB. – Philipp Reichart Dec 04 '12 at 16:20
  • No there aren't duplicate keys, but there might be duplicate values, but I don;t think we can leverage there. I have the keys and would like to query directly with them. How do you propose to represent it as 32-bit integers? – Amar Dec 04 '12 at 16:22
  • If the keys are actually numbers (and just happen to be strings because of the CSV format), you could just `parseInt()` them and use some map optimized for int to do all this in memory. If you need to access the values by key, how do the values look (content/data format, any exploitable properties)? Could you describe how you would access the map using the keys: what goes in, what should come out? Maybe there's another more compact data structure available. – Philipp Reichart Dec 04 '12 at 16:37
  • Keys are mix of numbers and letters. A sample key would be like `AAAAA111111`, and the value would be like `abcdefghijklmno pqrstuvw`, a simple text with one or 2 words. – Amar Dec 04 '12 at 16:56
  • @Amar You may have a look at LinkedIn's Voldemort. If only lookup is needed you can create a Voldemort 'read-only store' from your data with a Hadoop job. We've used this set-up several times from a MR job as a KV lookup without any problem – Lorand Bendig Dec 04 '12 at 17:11
  • @Amar : you can populate an in-memory map in the setup of the task reading from the dist cache file. Of course this solution doesn't scale out as other solutions proposed here. – Jorge González Lorenzo Dec 04 '12 at 17:57
  • @LorandBendig : How easy it would be blend in with Java code? – Amar Dec 05 '12 at 20:10
  • @JorgeGonzálezLorenzo : That is exactly what we do not want to do here. i.e. have everything in-memory, as we do not have that much memory available. Thanks anyways. – Amar Dec 05 '12 at 20:12
  • @Amar once you have a running Voldemort cluster you can initialize the DB connection in setup() do the lookup in map() and close the connection in cleanup(). A example how to establish a connection: https://github.com/voldemort/voldemort/blob/master/example/java/voldemort/examples/ClientExample.java – Lorand Bendig Dec 05 '12 at 20:32
  • @LorandBendig : We did not want to have a dedicated cluster always running for this purpose. – Amar Dec 13 '12 at 17:50

4 Answers4

3

So after trying out a bunch of things we are now using SQLite.

Following is what we did:

  1. We load all the key-value pair data in a pre-defined database file (Indexed it on the key column, though it increased the file-size but was worth it.)
  2. Store this file (key-value.db) in S3.
  3. Now this is passed to the hadoop jobs as distributed cache.
  4. In Configure of Mapper/Reducer the connection is opened (It takes around 50 ms) to the db file
  5. In map/reduce method query this db with the key (It took negligible time, didn't even need to profile it, it was so negligible!)
  6. Closed the connection in cleanup method of Mapper/Reducer
Amar
  • 11,930
  • 5
  • 50
  • 73
0

Try Redis. It seems this is exactly what you need.

AlexR
  • 114,158
  • 16
  • 130
  • 208
  • Redis is good. But AFAIK it doesn't have lookup table built in. It is like memcache, isn't it? But what if you do not have much memory available. Except that it supports certain data structures and atomic operations like increment/decrement it won't help. Let me know if you think otherwise. – Amar Dec 04 '12 at 15:53
0

I would try Oracle Berkerley DB Java Edition This support Maps and is both mature and scalable.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • How would it be useful in a mapred environment? Will I setup the DB every time the cluster is spawned. As here it works as follows: A cluster of 10,20 etc machine is fired up, do some jobs and all of them dies out once the jobs are done. I need to access these key-value pairs inside these jobs. And if we have a permanent DB setup somewhere else (in a different Box), then the network latency, however small it is will be difficult to contain as the lookup happens a few billion times. – Amar Dec 04 '12 at 16:01
  • You are right that RDBMS database is not the best idea but Berkerley DB doesn't work that way, which is why I suggested it. There are many other solutions http://nosql-database.org/ but this is one of the most mature ones. – Peter Lawrey Dec 04 '12 at 16:04
  • Thanks Peter, let me check this out. – Amar Dec 04 '12 at 16:23
  • Out of interest, how many keys do you have and what is their average length? – Peter Lawrey Dec 04 '12 at 16:33
  • There are 103302034 keys. And they all are text with 11 characters. – Amar Dec 04 '12 at 16:53
  • Try HBASE on your Hadoop nodes. – Thomas Jungblut Dec 04 '12 at 17:11
  • Could store them in sorted order as fixed length records and load them via memory mapped files? To do a lookup you could do a binary search which might take around a couple of micro-seconds. Or do you need to be able to modify the data as well? – Peter Lawrey Dec 04 '12 at 17:22
  • 1
    @PeterLawrey Thanks Peter :) your suggestion was very useful and we ended up with using SQLite and as you may know it uses memory mapped files internally. – Amar Dec 13 '12 at 17:55
0

I noticed you tagged this with elastic-map-reduce... if you're running on AWS, maybe DynamoDB would be appropriate.

Also, I'd like to clarify: is this dataset going to be the input to your MapReduce job, or is it a supplementary dataset that will be accessed randomly during the MapReduce job?

Joe K
  • 18,204
  • 2
  • 36
  • 58
  • We already tried it but populating the dynamoDB was taking way more time! If you have tried it and know that 103302034 many records can be inserted within a feasible time limit then please share it with me. – Amar Dec 05 '12 at 20:04
  • I believe they can be. You just need to provision very high write throughput and use well-threaded code to do it. DynamoDB definitely supports at least 10,000 writes/s, and possibly even higher if you contact AWS and request it. Just be sure the code that is populating it is either asynchronous or properly threaded and that it is writing records with a uniform distribution of keys (i.e., the order of keys is random). – Joe K Dec 05 '12 at 20:54
  • Yeah Joe but we did try this in a threaded manner, used even EMR to do it as explained in the following : http://stackoverflow.com/questions/10683136/amazon-elastic-mapreduce-mass-insert-from-s3-to-dynamodb-is-incredibly-slow But even with 1000 as the write throughput it wrote just 40K records in some 16 hours! – Amar Dec 13 '12 at 17:49