1

Is there a hash implementation around that doens't remember key values? I have to make a giant hash but I don't care what the keys are.

Edit:

Ruby's hash implementation stores the key's value. I would like hash that doesn't remember the key's value. It just uses the hash function to store your value and forgets the key. The reason for this is that I need to make a hash for about 5 gb of data and I don't care what the key values are after creating it. I only want to be able to look up the values based on other keys.

Edit Edit:

The language is kind of confusing. By key's value I mean this:

hsh['value'] = data

I don't care what 'value' is after the hash function stores data in the hash.

Edit^3:

Okay so here's what I am doing: I am generating every 35-letter (nucleotide) kmer for a set of multiple genes. Each gene has an ID. The hash looks like this:

kmers = { 'A...G' => [1, 5, 3], 'G...T' => [4, 9, 9, 3]  }

So the hash key is the kmer, and the value is an array containing IDs for the gene(s)/string(s) that have that kmer.

I am querying the hash for kmers in another dataset to quickly find matching genes. I don't care what the hash keys are, I just need to get the array of numbers from a kmer.

>> kmers['A...G']
=> [1, 5, 3]

>> kmers.keys.first
=> "Sorry Dave, I can't do that"
Austin Richardson
  • 8,078
  • 13
  • 43
  • 49
  • Could you explain what you want exactly? And in which way arrays fail to provide what you want? – sepp2k May 26 '11 at 17:33
  • 2
    Why don/t use use array instead? – bor1s May 26 '11 at 17:34
  • 3
    Are you looking for a set? Perhaps an array? – maerics May 26 '11 at 17:35
  • 1
    Do you only want to check if an element is in there/iterate all or do you still need to access specific elements by their unique (unknown) keys? – J-_-L May 26 '11 at 17:37
  • 1
    Are you looking for a container that allows you to look up by key but won't store the key? I don't think, this will work with a hash because even though it stores the values according to hash values of the keys, the hash value alone is not enough for lookup due to usually inevitable hash value collisions. – Tilman Vogel May 26 '11 at 17:50
  • Is there an associative bloom filter? – Austin Richardson May 26 '11 at 17:57
  • 1
    This is still really confusing. Can you provide a real example of what you need to store and the operations that you need to make over that collection ? – Roberto Decurnex May 26 '11 at 17:59
  • 1
    Yes, please describe what you want to do with the hash after it has been populated. – Tilman Vogel May 26 '11 at 18:02

5 Answers5

4

I guess you want a set, allthough it stores unique keys and no values. It has the fast lookup time from a hash. Set is included in the standard libtrary.

require 'set'
s = Set.new
s << 'aaa'
p s.merge(['ccc', 'ddd'])  #=> #<Set: {"aaa", "ccc", "ddd"}>
steenslag
  • 79,051
  • 16
  • 138
  • 171
2

Even if there was an oddball hash that just recorded existence (which is how I understand the question) you probably wouldn't want to use it, as the built-in Hash would be simpler, faster, not require a gem, etc. So just set...

 h[k] = k

...and call it a day...

DigitalRoss
  • 143,651
  • 25
  • 248
  • 329
  • I have to hash every unique 35-letter kmer of a 5 gig string. I was hoping there'd be a nice C implementation already made for me :) But I guess I will just have to try writing one now. – Austin Richardson May 26 '11 at 17:56
  • 1
    5 GB? Wow. In any case, look at Ruby Inline for a neat gem that should make writing your custom C code easier. A good example of the use of Ruby Inline is ImageScience. – DigitalRoss May 26 '11 at 18:07
  • Got it, you need a hash map implementation with a hash function instead of this simple stored keys. That's really nice actually. I will try to dig a little more and see if there's something like that for ruby. – Roberto Decurnex May 26 '11 at 18:14
  • @NeX, I don't know Ruby (apart from its name). Are you telling that Ruby hashes per default do not use a hash function? I.e. a hash is nothing but an array of pairs? Unless it's doing some other trick, performance would be very slow. – Tilman Vogel May 26 '11 at 18:42
  • Performance is not really the problem here. Memory usage may be drastically improved if the key is not stored but just used to generate some sort of unique reference. Again, some constraints must exists in order to make this happen. Using any object as key is really great but do not really apply for this scenarios. – Roberto Decurnex May 26 '11 at 20:03
  • @Tilman Vogel: You are right, Ruby Hashes do use a hash function. And they are [much faster](http://stackoverflow.com/questions/5551168/performance-of-arrays-and-hashes-in-ruby/5552062#5552062) than arrays. – steenslag May 26 '11 at 21:05
  • @DigitalRoss: Based on Austin's mention of kmers, I think he's talking about looking at the entire genome of an organism. – Andrew Grimm May 26 '11 at 22:59
1

I assume the 5 gb string is a genome, and the kmers are 35 base pair nucleotide sequences.

What I'd probably do (slightly simplified) is:

human_genome = File.read("human_genome.txt")
human_kmers = Set.new
human_genome.each_cons(35) do |potential_kmer|
  human_kmers << potential_kmer unless human_kmers.include?(potential_kmer)
end
unknown_gene = File.read("unknown_gene.txt")
related_to_humans = unknown_gene.each_cons(35).any? do |unknown_gene_kmer|
  human_kmers.include?(unknown_gene_kmer)
end
Andrew Grimm
  • 78,473
  • 57
  • 200
  • 338
  • It's a good real life example (sadly we are still waiting for Austin to provide more information). The only problem here is that looks like this guy will have keys that do not match the stored values so we still need a key => value collection :S – Roberto Decurnex May 27 '11 at 04:34
  • Close, it's a database of genes. I need to know which specific gene has a kmer. So I can't just check if a set includes it or not. I could create sets for each gene but that would take too long. – Austin Richardson May 27 '11 at 17:29
0

I have to make a giant hash but I don't care what the keys are.

That is called an array. Just use an array. A hash without keys is not a hash at all and loses its value. If you don't need key-value lookup then you don't need a hash.

Ed S.
  • 122,712
  • 22
  • 185
  • 265
  • No, that is called a hash. Ruby's hash implementation stores the key. I need a hash that doesn't store the key but instead just hashes a query and returns the value. – Austin Richardson May 26 '11 at 17:48
  • @Austin: You can't just "hash the query and return the value" - you still need to check that the keys are actually equal (and equal hash value does not tell you that) and for that you need to store the keys. – sepp2k May 26 '11 at 17:54
  • 1
    Yeah, the question just doesn't make any sense. I don't know how to help you. If you need to look up a value by a key you will obviously need to store the key. – Ed S. May 26 '11 at 18:08
  • That's not true. Based on the current ruby implementation of hash the key needs to be stored (since it accepts any object as key the object reference will exists until the hash is destroyed) but Austin seems to be looking for another kind of key => value collection. Leaving aside how the Hash works on ruby we can talk about A hash key, a hash function and a hash table. Given a hash function we should be able to submit a key (with some constraints), process it dynamically with the hash function and access a record of the collection based on the hash function result. – Roberto Decurnex May 27 '11 at 13:17
  • @NeX: Sure, you could put it into an array index based on the hash and then discard it. You have to remember that my response was based off of the original post before any edits. – Ed S. May 27 '11 at 15:24
  • @NeX: And how would that approach handle hash collisions without remembering the key? – sepp2k May 27 '11 at 15:34
  • @sepp2k: It wouldn't; it would happily stomp out the object as long as the hash gave the same result. – Ed S. May 27 '11 at 15:37
  • Exactly, but if we define some key constraints (let's suppose string keys with a max length of 8) we will be saving memory + preventing the collision. Again we are discussing without a defined scenario. – Roberto Decurnex May 27 '11 at 15:48
0

Use an Array. An Array indexes by integers instead of keys. http://www.ruby-doc.org/core/classes/Array.html

a = []
a << "hello"
puts a #=> ["hello"]
Nathan Kleyn
  • 5,103
  • 3
  • 32
  • 49