big array manipulation is very slow in ruby

Question

I have the following scenario:

I need to figure out the unique list of ids across a very large set.

So for example, I have 6000 arrays of ids (followers list), each can range in size between 1 and 25000 (their follower list).

I want to get the unique list of ids across all of these arrays of ids (unique followers of followers). Once that is done I need to subtract out another list (another persons follower list) of ids and get a final count.

The final set of unique ids grows to around 60,000,000 records. In ruby when adding the arrays to the big array, it starts to get very slow around a couple of million. Adding to the set takes .1 seconds at first, then grows to taking over 4 seconds at 2 million (no where near where I need to go).

I wrote a test program in java and it does the whole thing in less than a minute.

Maybe I am doing this inefficiently in ruby, or there is another way. Since my main code is proprietary I have wrote a simple test program to simulate the issue:

big_array = []
loop_counter = 0
start_time = Time.now
# final target size of the big array
while big_array.length < 60000000
 loop_counter+=1
 # target size of one persons follower list
 random_size_of_followers = rand(5000)
 follower_list = []
 follower_counter = 0
   while follower_counter < random_size_of_followers
     follower_counter+=1
     # make ids very large so we get good spread and only some amt of dupes
     follower_id = rand(240000000) + 100000
     follower_list << follower_id
   end
 # combine the big list with this list
 big_array = big_array | follower_list
 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 if loop_counter % 100 == 0
   elapsed_time = end_time - start_time
   average_time = elapsed_time.to_f/loop_counter.to_f
   puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}"
   start_time = Time.now
 end
end

Any suggestions, is it time to switch to jruby and move stuff like this to java?

Just wanted to point out that you had `loop_counter=0` in your timing section. While the array-accessing approach is **much slower** than the hash approach suggested, the loop time doesn't actually grow that fast. By 2 million records, the loop time triples to about .27 sec on my machine, from a .09 sec initial loop time. — Eric Hu, Apr 22 '13 at 01:12
Ruby is plenty fast, you're just doing it the wrong way. This is really a use-case for a database, not in-memory array manipulation in any language. A good DBM can quickly find distinct values and associations, all before the query gets out of the database. I'll recommend [Sequel](http://sequel.rubyforge.org/) as a great database ORM that will make it easier to maintain and query. — the Tin Man, Apr 22 '13 at 04:51

score 5 · Accepted Answer · edited Apr 22 '13 at 01:09

The method you're using there is horribly inefficient, so it's no surprise this is slow. When you're trying to keep track of unique things, an Array requires way more processing than a Hash equivalent.

Here's a simple refactoring that increases the speed about 100x:

all_followers = { }
loop_counter = 0
start_time = Time.now

while (all_followers.length < 60000000)
  # target size of one persons follower list
  follower_list = []

  rand(5000).times do
    follower_id = rand(240000000) + 100000
    follower_list << follower_id
    all_followers[follower_id] = true
  end

 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 loop_counter += 1

  if (loop_counter % 100 == 0)
    elapsed_time = end_time - start_time
    average_time = elapsed_time.to_f/loop_counter.to_f
    puts "average time for loop is #{average_time}, total size of all_followers is #{all_followers.length}"
    start_time = Time.now
  end
end

The nice thing about a Hash is that it's impossible to have duplicates. If you need to list all the followers at any time, use all_followers.keys to get the IDs.

Hashes take up more memory than their Array equivalents, but this is the price you have to pay for performance. I'd also suspect one of the big memory consumers here is the many individual lists of followers that are generated and seemingly never used, so perhaps you could skip that step entirely.

The key thing here is that the Array | operator is not very efficient, especially when operating on very large arrays.

thanks, this seems promising, and much much faster, in real life I have the follower_list already provided, so I would need to add that to the hash, should I just iterate over it and insert key by key like: all_followers.each{|follower| all_followers[follower] = true}, or is there a faster way to add them. — Joelio, Oct 20 '11 at 15:39
Instead of a Hash, if you already have an Array use a [`Set`](http://ruby-doc.org/stdlib-1.9.2/libdoc/set/rdoc/index.html): `a=[1,2,3,3,4]; b=[5,1,7]; Set[*a]+Set[*b] #=> #` — Phrogz, Oct 20 '11 at 16:37
Iterating and adding seems the way to go unless you're using the Set method proposed by Phrogz. Then you could convert and add: `all_followers += follower_ids.to_set` — tadman, Oct 20 '11 at 19:10
Interesting, I had tried set earlier in the day and didnt see any perf gain over array, but when I switched to hash, an awesome perf gain. No time to dig into this too much now, but I would be interested into why. — Joelio, Oct 20 '11 at 20:46
Interesting, I found that when trying to do & (or intersection) that the array & method is faster than trying to roll my own with hashes. Does anyone have any tips for making & faster on big lists? — Joelio, Nov 04 '11 at 15:45

score 2 · Answer 2 · answered Dec 04 '14 at 04:05

Here is an example to handle unique objects with array, hash and set:

require 'benchmark'
require 'set'
require 'random_token'

n = 10000

Benchmark.bm(7) do |x|
  x.report("array:") do
    created_tokens = []
    while created_tokens.size < n
      token = RandomToken.gen(10)
      if created_tokens.include?(token)
        next
      else
        created_tokens << token
      end
    end
    results = created_tokens
  end

  x.report("hash:") do
    created_tokens_hash = {}
    while created_tokens_hash.size < n
      token = RandomToken.gen(10)
      created_tokens_hash[token] = true
    end
    results = created_tokens_hash.keys
  end

  x.report("set:") do
    created_tokens_set = Set.new
    while created_tokens_set.size < n
      token = RandomToken.gen(10)
      created_tokens_set << token
    end
    results = created_tokens_set.to_a
  end
end

and their benchmark:

              user     system      total        real
array:    8.860000   0.050000   8.910000 (  9.112402)
hash:     2.030000   0.010000   2.040000 (  2.062945)
set:      2.000000   0.000000   2.000000 (  2.037125)

Refs:

ruby處理unique物件

big array manipulation is very slow in ruby

2 Answers2