I have the following scenario:
I need to figure out the unique list of ids across a very large set.
So for example, I have 6000 arrays of ids (followers list), each can range in size between 1 and 25000 (their follower list).
I want to get the unique list of ids across all of these arrays of ids (unique followers of followers). Once that is done I need to subtract out another list (another persons follower list) of ids and get a final count.
The final set of unique ids grows to around 60,000,000 records. In ruby when adding the arrays to the big array, it starts to get very slow around a couple of million. Adding to the set takes .1 seconds at first, then grows to taking over 4 seconds at 2 million (no where near where I need to go).
I wrote a test program in java and it does the whole thing in less than a minute.
Maybe I am doing this inefficiently in ruby, or there is another way. Since my main code is proprietary I have wrote a simple test program to simulate the issue:
big_array = []
loop_counter = 0
start_time = Time.now
# final target size of the big array
while big_array.length < 60000000
loop_counter+=1
# target size of one persons follower list
random_size_of_followers = rand(5000)
follower_list = []
follower_counter = 0
while follower_counter < random_size_of_followers
follower_counter+=1
# make ids very large so we get good spread and only some amt of dupes
follower_id = rand(240000000) + 100000
follower_list << follower_id
end
# combine the big list with this list
big_array = big_array | follower_list
end_time = Time.now
# every 100 iterations check where we are and how long each loop and combine takes.
if loop_counter % 100 == 0
elapsed_time = end_time - start_time
average_time = elapsed_time.to_f/loop_counter.to_f
puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}"
start_time = Time.now
end
end
Any suggestions, is it time to switch to jruby and move stuff like this to java?