1

I have a very big hash and I want to iterate it. Hash.each seems to be too slow. Is there any efficient way to do this?

How about convert this hash to an array?


In each loop I'm doing very simple string stuff:

name_hash.each {|name, str|

  record += name.to_s + "\|" + str +"\n"

}

and the hash uses people's names as the key, some related content as the value:

name_hash = {:"jose garcia" => "ca:tw#2@1,2@:th#1@3@;ar:tw#1@4@:fi#1@5@;ny:tw#1@6@;"}
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Bruce Lin
  • 2,700
  • 6
  • 28
  • 38
  • I tried .each, for a 1M record hash it takes over 5 hours – Bruce Lin Aug 15 '12 at 23:01
  • I'm asking if you tried what you perceived to be the solution. – Dave Newton Aug 15 '12 at 23:10
  • Hash iteration itself should be "fast". In Ruby 2x it is implemented with a "linked chain" (for the nice order-keeping properties). What is being done during iteration? (1M - 1 million?? - is a "fair amount" of items, so if even each item takes 0.01 seconds or, 100/second, it would take 2.7 hours. That is, the issue is likely *inside* the `each` block and not the each method/iteration itself. Perhaps there is a better way to solve this problem?) –  Aug 15 '12 at 23:13
  • Please include the *full relevant code*. As djconnel has shown in an answer, the actual iteration is very fast. Thus it is highly suspect that what is done *inside* the `each` block is the bottleneck. Also DigitalRoss suggested that there might be a better/different solution entirely, assuming that the hash data comes from or utilizes the database/model itself .. –  Aug 15 '12 at 23:26
  • Inside the each block just did some string manipulation, as simple as name_hash.each {|name, str| record += name.to_s + "\|" + str +"\n" } – Bruce Lin Aug 15 '12 at 23:29
  • The hash use people's name as key, and some related content to be the value. eg. name_hash = {:"jose garcia" => "ca:tw#2@1,2@:th#1@3@;ar:tw#1@4@:fi#1@5@;ny:tw#1@6@;"} – Bruce Lin Aug 15 '12 at 23:30
  • @BruceXindaLin Please put the extra/new information in the post, along with how long it is taking. (It will get lost in a comment!) –  Aug 15 '12 at 23:30
  • 1
    A question that hasn't been asked is, how much free RAM is on the machine you're using? Five hours for 1 million records seems long unless the machine you're on is constrained by memory and is swapping. – the Tin Man Aug 16 '12 at 04:17

6 Answers6

4

Consider the following example, which uses a hash of 1 million elements:

#! /usr/bin/env ruby
require 'benchmark'

h = {}
1_000_000.times do |n|
  h[n] = rand
end

puts Benchmark.measure { h.each { |k, v| } }

a = nil
puts Benchmark.measure { a = h.to_a }
puts Benchmark.measure { a.each { |k, v| } }

I run this on my system at work (running Ruby 1.8.5) and I get:

  0.350000   0.020000   0.370000 (  0.380571)
  0.300000   0.020000   0.320000 (  0.307207)
  0.160000   0.040000   0.200000 (  0.198388)

So iterating over the array is indeed faster (0.16 seconds versus 0.35 seconds for the hash). But it took 0.3 seconds to generate the array. So the net process is slower 0.46 seconds versus 0.35 seconds.

So it seems it's best just to iterate over the hash, at least in this test case.

djconnel
  • 441
  • 2
  • 4
  • And *all* of the numbers posted are well short of 5 hours ;-) +1 for the tiny benchmarks; while micro they clearly indicate that the problem is *not* with the implementation of `each` (of either Hash or Array) .. which then would imply the performance bottleneck is from what is done *inside* the `each` block. –  Aug 15 '12 at 23:20
  • Woah, Ruby 1.8.5?! Why so old? – Andrew Marshall Aug 16 '12 at 01:01
  • Redhat Enterprise Linux version 5... our code is fairly mature and since I work in support rather than development and we upgrade slowly. – djconnel Aug 16 '12 at 16:15
2

A more idiomatic way to do that in ruby:

record = name_hash.map{|k,v| "#{k}|#{v}"}.join("\n")

I don't know how that will compare with speed, but part of the problem might be because you keep adding a little bit onto a string and creating new (ever longer) string objects with each iteration. The join is done in C and might perform better.

DGM
  • 26,629
  • 7
  • 58
  • 79
2

String#+ is slow. This should improve it

 record = name_hash.map{|line| line.join("|")}.join("\n")

If you are using this to output to somewhere, you should not create a huge string but rather write line by line to the output.

sawa
  • 165,429
  • 45
  • 277
  • 381
  • 1
    I like that even better than my answer! – DGM Aug 15 '12 at 23:53
  • 1
    For any wondering what the difference is between this and my answer, `map` called with one param yields an array, [key,value]. The two param example I posted just assigns the key/value directly. `#{}` is faster than `String#+`, but I'm not sure which version of the block param is faster. – DGM Aug 16 '12 at 00:04
1

Iterating over large collections is slow, the each method is not what's throttling it. What in your loop are you doing that's so slow? If you need to convert to an array, you can do that by calling some_hash.to_a

sgrif
  • 3,702
  • 24
  • 30
1

Probably "by making a single db query"

Converting a large Hash to an Array will require creating a large object and will require two iterations, albeit with one of them being internal to the interpreter and probably very fast.

This is unlikely to be faster than just iterating over the Hash, but it might be for large objects.

Check out the Standard Library Benchmark package for an easy way to measure runtime.

I would also venture a guess that the real problem here is that you have a Hash-like ActiveRecord object that imposes a round-trip to your db server in each cycle of the enumeration. It's possible that what you really want is to bypass AR and run a native query to retrieve everything at once in a single round-trip.

DigitalRoss
  • 143,651
  • 25
  • 248
  • 329
  • Why would converting a Hash into an Array *require* a "vast number of new Object[s]"? Also, there is no [solid] indication in the post that the data comes from a [relational] database .. –  Aug 15 '12 at 23:18
  • Hmm, now that you mention it, most of the objects will be reused or they will be immutable inline values. I'll update the answer. And as for the db, well, he did tag the question with Rails and the reported times seem way too slow to be anything else. – DigitalRoss Aug 15 '12 at 23:26
  • I'm not saying it's not something .. silly like that :) Hopefully my comment on the main post will illicit more information. –  Aug 15 '12 at 23:28
  • @DigitalRoss I didn't use AR. It's just processing a txt file. – Bruce Lin Aug 15 '12 at 23:37
  • 2
    Then why is this tagged ruby-on-rails? – DGM Aug 15 '12 at 23:57
1

I had thought ruby 1.9.x had made hash iteration faster but could have been wrong. If it's simple structures you could try a different hash, like https://github.com/rdp/google_hash which is one I hacked up to make #each more reliable...

rogerdpack
  • 62,887
  • 36
  • 269
  • 388