Ruby - array intersection (with duplicates)

Question

I have array(1 and 2). How can I get array3 from them?

array1 = [2,2,2,2,3,3,4,5,6,7,8,9]

array2 = [2,2,2,3,4,4,4,4,8,8,0,0,0]

array3 = [2,2,2,3,4,8]

array1 & array2 returns [2,3,4,8], but I need to hold onto the duplicates.

By duplicate do you mean, values at the same position in both the arrays? — Sahil, Jun 24 '16 at 18:59
I think he does not, because there is a 3 in `array3` but the values do not line up in `array1` and `array2`. — Ben Visness, Jun 24 '16 at 19:42
You should specify whether the order of elements in the result is important and whether you need minimum or maximum number of matches, that is, if the arrays comparison order is important. — Nic Nilov, Jun 24 '16 at 20:25

Cary Swoveland · Accepted Answer · 2016-06-25T00:26:35.927

15

(array1 & array2).flat_map { |n| [n]*[array1.count(n), array2.count(n)].min }
  #=> [2,2,2,3,4,8]

The steps:

a = array1 & array2 
  #=> [2, 3, 4, 8]

The first element of a (2) is passed to the block and assigned to the block variable:

n = 2

and the block calculation is performed:

[2]*[array1.count(2), array2.count(2)].min
  #=> [2]*[4,3].min
  #=> [2]*3
  #=> [2,2,2]

so 2 is mapped to [2,2,2]. The calculations are similar for the remaining three elements of a. As I am using flat_map, this returns [2,2,2,3,4,8].

Do you have trouble remembering how Enumerable#flat_map differs from Enumerable#map? Suppose I had used map rather than flat_map. Then

a.map { |n| [n]*[array1.count(n), array2.count(n)].min }
  #=> [[2, 2, 2], [3], [4], [8]]

flat_map does nothing more that put a splat in front of each of those arrays:

[*[2, 2, 2], *[3], *[4], *[8]]
  #=> [2, 2, 2, 3, 4, 8]

If the arrays array1 and array2 are large and efficiency is a concern, we could do a bit of O(N) pre-processing:

def cnt(arr)
  arr.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
end

cnt1 = cnt(array1)
  #=> {2=>4, 3=>2, 4=>1, 5=>1, 6=>1, 7=>1, 8=>1, 9=>1} 
cnt2 = cnt(array2)
  #=> {2=>3, 3=>1, 4=>4, 8=>2, 0=>3} 

(array1 & array2).flat_map { |n| [n]*[cnt1[n], cnt2[n]].min }
  #=> [2,2,2,3,4,8]

edited Jun 25 '16 at 00:26

answered Jun 24 '16 at 19:53

Cary Swoveland

106,649
6
63
100

2

The best answer here so far (on not-so-huge arrays, since mine is O(N) and this one is O(N²) in the worst case.) – Aleksei Matiushkin Jun 24 '16 at 20:12
Efficient answer but not very readable in my opinion. I was assuming the OP was new to programming and sought to give a more approachable answer. I guess you should never assume, though. – Jun 24 '16 at 22:03
@Tommyixi, you are right. I was pressed for time when I posted the answer, so omitted an explanation. I've now added one. btw, I don't feel that the need for an explanation depends on the OP's experience with Ruby, as one may be needed by other readers. – Cary Swoveland Jun 24 '16 at 23:09
Good point. Though "efficient programming" and most condensed may not always be the best solution. Perhaps that's a discussion for another question. I appreciate the explanation. – Jun 25 '16 at 00:23

Cam · Answer 2 · 2016-06-26T20:21:29.610

This is a fun one; Cary's flat_map solution is particularly clever. Here's an alternative one-liner using regular old map with some assistance from each_with_object:

array1.each_with_object(array2.dup).map{|v,t| v if (l = t.index v) && t.slice!(l) }.compact
 #=> [2,2,2,3,4,8]

Much of the complexity here involves inline gymnastics used to provide map with sufficient information to complete the task:

 #
 # we want to duplicate array2 since we'll be
 # mutating it to track duplicates       
 #                       \        array1     array2
 #                        \        value     copy  
 #                         \            \   /
array1.each_with_object(array2.dup).map{|v,t| ... }
 #         |                         /      
 # Enumerator for array1    Iterate over              
 # with a copy of array2    Enumerator with map

We can use each_with_object to provide an Enumerator for array1 that also gives our method chain access to a copy of array2. Map then can iterate over the each_with_object Enumerator (which references array1), loading each value into local variable v and our array2 copy into local variable t. From there:

 #                map the value IF...
 #               /  it exists in     and we were able to
 #              / our array2 copy    remove it from our copy
 #            /          |              |
map{|v,t| v if (l = t.index v) && t.slice!(l) }.compact
 #   |  \         \                               |
 # array1 \        \                          dump nils
 # value   array2   \
 #         copy      load index position into temporary variable l

We iterate over each value of array1 and search for whether the value exists within array2 (via t). If it exists, we remove the first occurance of the value from our copy of array2 and map the value to our resultant array.

Note the t.index(v) check before t.slice!(t.index(v)) is used as short circuit protection in case the value does not exist within t, our copy of array2. We also use an in-line trick of assigning the index value to a local variable l here: (l = t.index v) so we can reference l in the subsequent boolean check: t.slice!(l).

Finally, because this methodology will map nil values whenever an array1 value does not exist within array2, we compact the result to remove the nils.

For those curious, here are some benchmark tests of the solutions presented thus far. First, here are the speeds clocked performing the operation 100,000 times on the sample arrays:

Cary:        1.050000   0.010000   1.060000 (  1.061217)
Cary+:       1.580000   0.010000   1.590000 (  1.603645)
Cam:         0.550000   0.010000   0.560000 (  0.552062)
Mudasobwa:   2.540000   0.050000   2.590000 (  2.610395)
Sergii:      0.660000   0.000000   0.660000 (  0.665408)
Sahil:       1.750000   0.010000   1.760000 (  1.769624)
#Tommy:      0.290000   0.000000   0.290000 (  0.290114)

If we expand the test arrays to hold 10000 integers with a high degree of intersection...

array1 = array2 = []
10000.times{ array1 << rand(10) }
10000.times{ array2 << rand(10) }

and loop 100 times, the simple loop solution (Sahil) begins to distinguish itself. Cary's solution also holds up well, especially with preprocessing:

                 user     system      total        real
Cary:        1.590000   0.020000   1.610000 (  1.615798)
Cary+:       0.870000   0.010000   0.880000 (  0.879331)
Cam:         6.680000   0.090000   6.770000 (  6.838829)
Mudasobwa:   6.740000   0.080000   6.820000 (  6.898394)
Sergii:      6.760000   0.100000   6.860000 (  6.962025)
Sahil:       0.740000   0.030000   0.770000 (  0.785975)
#Tommy:      0.430000   0.010000   0.440000 (  0.446482)

For arrays 1/10th the size with 1000 integers and a low degree of intersection, however...

array1 = array2 = []
1000.times{ array1 << rand(10000) }
1000.times{ array2 << rand(10000) }

when we loop 10 times, the flat_map solution gets flattened... except if we use preprocessing (Cary+):

                 user     system      total        real
Cary:      135.400000   0.700000 136.100000 (137.123393)
Cary+:       0.270000   0.010000   0.280000 (  0.268255)
Cam:         0.670000   0.000000   0.670000 (  0.676438)
Mudasobwa:   0.670000   0.010000   0.680000 (  0.684088)
Sergii:      0.660000   0.010000   0.670000 (  0.673881)
Sahil:       1.970000   2.130000   4.100000 (  4.121759)
#Tommy:      0.050000   0.000000   0.050000 (  0.045970)

Here's a gist with the benchmarks: https://gist.github.com/camerican/139463b4bd9e0fd89424377931042ce4

Interesting solution and comparison, and excellent presentation. You are to be commended for all your work on this. Could you add the variant of my solution with the "pre-processing" to the benchmark? — Cary Swoveland, Jun 25 '16 at 07:03
The tests are now updated to include flat_map w/ preprocessing under `Cary+`. It helps a lot with the larger arrays, though the extra overhead is penalized on the 100,000x test. I've linked a gist with the tests. — Cam, Jun 25 '16 at 16:26
Upon closer inspection, Tommyixi's solution is fast b/c it's broken. It makes a big assumption about index alignment. My speed tests did not validate correctness of proposed solutions. — Cam, Jun 25 '16 at 20:37
@Cam, I noticed that in the benchmark tests, you were calling tommy's function instead of my function. — Sahil, Jun 26 '16 at 10:20
@Sahil Good catch! The benchmarks have been updated. Your solution performed well in the high-intersection condition. — Cam, Jun 26 '16 at 20:23

Aleksei Matiushkin · Answer 3 · 2016-06-24T19:34:48.567

1

array1 = [2,2,2,2,3,3,4,5,6,7,8,9]
array2 = [2,2,2,3,4,4,4,4,8,8,0,0,0]

a1, a2 = array1.dup, array2.dup # we’ll mutate them

loop.with_object([]) do |_, memo|
  break memo if a1.empty? || a2.empty?
  e = a2.delete_at(a2.index(a1.shift)) rescue nil
  memo << e if e
end
#⇒ [2,2,2,3,4,8]

edited Jun 24 '16 at 19:34

answered Jun 24 '16 at 19:11

Aleksei Matiushkin

119,336
10
100
160

Sahil · Answer 4 · 2016-06-26T09:35:54.570

1

    array1 = [2,2,2,2,3,3,4,5,6,7,8,9]
    array2 = [2,2,2,3,4,4,4,4,8,8,0,0,0]

Getting the frequency of each element in the sample arrays:

    a1_freq=Hash.new(0); a2_freq=Hash.new(0); dup_items=[];
    array1.each {|a| a1_freq[a]+=1 }
    array2.each {|b| a2_freq[b]+=1 }

Finally compare the elements if they are present in the other array or not. If yes, then take minimum count of the common element found in both sample arrays.

    a1_freq.each {|k,v| a2_freq[k] ? dup_items+=[k]*[v,a2_freq[k]].min : nil}
    #dup_items=> [2, 2, 2, 3, 4, 8]

edited Jun 26 '16 at 09:35

answered Jun 24 '16 at 19:22

Sahil

3,338
1
21
43

1

Consider using [`Hash#default_proc`](http://ruby-doc.org/core-2.1.5/Hash.html#method-c-new): `a1_freq = Hash.new { |h, k| h[k] = 0 }` (or default value `a1_freq = Hash.new(0)`) and avoid weird ternary: `array1.each {|a| a1_freq[a]+=1 }`. – Aleksei Matiushkin Jun 24 '16 at 19:39
Thanks did not realize I could have initialized it with 0. – Sahil Jun 24 '16 at 19:50
1

Very good, and very fast! I added your idea of first creating the counting hashes after you posted, but somehow had missed your solution until now. – Cary Swoveland Jun 25 '16 at 19:08
Thanks @Cary, at the time of writing this solution I did not think of how fast it could be and just answered it. But I will make it a point to do the performance testing from now on as it gives great insights into how good are code works. – Sahil Jun 25 '16 at 19:54

score 0 · Answer 5 · edited Jun 24 '16 at 20:16

This is a bit verbose, but assuming you mean where the values are at the same position:

def combine(array1, array2)
    longer_array = array1.length > array2.length ? array1 : array2

    intersection = []
    count = 0
    longer_array.each do |item|
        if array1 == longer_array
            looped_array = array2
        else
            looped_array = array1
        end
        if item == looped_array[count]
            intersection.push(item)
        end
        count +=1
    end
    print intersection
end


array_1 = [2,2,2,2,3,3,4,5,6,7,8,9]
array_2 = [2,2,2,3,4,4,4,4,8,8,0,0,0]


combine(array_1, array_2)

I just wanted to point out that I have no clue how you got to array 3 because index position 3 on all three arrays differ:

array_1[3] = 2

array_2[3] = 3

array_3[3] = 3

Sergii K · Answer 6 · 2016-06-24T23:04:01.777

0

I'll try to reach expected result in that way:

array1, array2 = [array1, array2].sort_by(&:size)
arr_copy = array2.dup

array1.each.with_object([]) do |el, arr|
    index = arr_copy.find_index(el)
    arr << arr_copy.slice!(index) if index
end
# => [2, 2, 2, 3, 4, 8]

edited Jun 24 '16 at 23:04

answered Jun 24 '16 at 22:58

Sergii K

845
9
16

Jörg W Mittag · Answer 7 · 2021-09-09T13:58:10.120

It looks like what you have there are not really arrays, they are multisets or bags.

There is a general rule in programming: if you choose your data representation right, your algorithms become simpler.

So, if you use multisets instead of arrays, your problem will become trivial, since what you are looking for is literally just the intersection of two multisets.

Unfortunately, there is no multiset implementation in the core or standard libraries, but there are a couple of multiset gems available on the web. For example, there is the multimap gem, which also includes a multiset. Unfortunately, it needs a little bit of love and care, since it uses a C extension that only works until YARV 2.2. There is also the multiset gem. You can also find some multiset implementations on Stack Overflow or Code Review.SE.

require 'multiset'

multiset1 = Multiset.new(array1)
#=> #<Multiset:#4 2, #2 3, #1 4, #1 5, #1 6, #1 7, #1 8, #1 9>

multiset2 = Multiset.new(array2)
#=> #<Multiset:#3 2, #1 3, #4 4, #2 8, #3 0>

multiset3 = multiset1 & multiset2
#=> #<Multiset:#3 2, #1 3, #1 4, #1 8>

Personally, I am not too big a fan of the inspect output, but we can see what's going on and that the result is correct: multiset3 contains 3 × 2, 1 × 3, 1 × 4, and 1 × 8.

If you really need the result as an Array, you can use Multiset#to_a:

multiset3.to_a
#=> [2, 2, 2, 3, 4, 8]

multiset3.to_a == array3
#=> true

Ruby - array intersection (with duplicates)

7 Answers7

Linked

Related