1

I need to remove all duplicates from a set of arrays, but we define 'duplicate' in a special way here: Two 4 element arrays are 'dupes' if they share the first two elements in any order and the last two elements in any order. So my thought is to split these arrays into 2 halves, sort those 2-element half arrays, and put them back together again to form 4-element arrays. Then we will have some textbook duplicates we can remove.

Is this a good approach?
We start with a set of 6 4-element arrays, none of which is an exact duplicate of another.

[6, 4, 3, 2]
[4, 6, 2, 3]
[3, 4, 2, 6]
[4, 3, 6, 2]
[3, 6, 2, 4]
[6, 3, 4, 2]

split each array in the middle

[[6, 4], [3, 2]] 
[[4, 6], [2, 3]]
[[3, 4], [2, 6]]
[[4, 3], [6, 2]]
[[3, 6], [2, 4]]
[[6, 3], [4, 2]]

Here's the hard part in Neo4j! Sort each of the two inner arrays only.

[[4, 6], [2, 3]]
[[4, 6], [2, 3]]
[[3, 4], [2, 6]]
[[3, 4], [2, 6]]
[[3, 6], [2, 4]]
[[3, 6], [2, 4]]

Put them back together.

[4, 6, 2, 3]
[4, 6, 2, 3]
[3, 4, 2, 6]
[3, 4, 2, 6]
[3, 6, 2, 4]
[3, 6, 2, 4]

Dedupe by using DISTINCT.

[4, 6, 2, 3]
[3, 4, 2, 6]
[3, 6, 2, 4]
mojo2go
  • 107
  • 10

1 Answers1

1

This very simple query (with your sample data) implements your approach, which seems reasonable:

WITH [
  [6, 4, 3, 2],
  [4, 6, 2, 3],
  [3, 4, 2, 6],
  [4, 3, 6, 2],
  [3, 6, 2, 4],
  [6, 3, 4, 2]
] AS data
UNWIND data AS d
RETURN DISTINCT
  CASE WHEN d[0] > d[1] THEN [d[1], d[0]] ELSE d[0..2] END +
  CASE WHEN d[2] > d[3] THEN [d[3], d[2]] ELSE d[2..] END AS res;

The result is:

+-----------+
| res       |
+-----------+
| [4,6,2,3] |
| [3,4,2,6] |
| [3,6,2,4] |
+-----------+

Handling arrays of any (even) size:

The following query will accept as input a collection of sub-collections of even size (does not have to be 4). It will return a collection of distinct properly internally "sorted" collections.

For example (notice that the sub-collections do not have to be the same size):

WITH [
  [6, 4, 3, 2, 3, 2],
  [3, 4, 2, 6, 7, 8],
  [4, 3, 6, 2, 8, 7],
  [3, 6, 2, 4],
  [6, 3, 4, 2],
  [4, 6, 2, 3, 2, 3]
] AS data
WITH EXTRACT(d IN data |
  REDUCE(s = [], i IN RANGE(0, SIZE(d)-1, 2) | s + CASE WHEN d[i] > d[i+1] THEN [d[i+1], d[i]] ELSE d[i..i+2] END)) AS sorted
UNWIND sorted AS res
RETURN DISTINCT res;

The output of the above is:

+---------------+
| res           |
+---------------+
| [4,6,2,3,2,3] |
| [3,4,2,6,7,8] |
| [3,6,2,4]     |
+---------------+
cybersam
  • 63,203
  • 6
  • 53
  • 76
  • That was tricky @cybersam. That's a really succinct solution to that particular problem. Beyond the CASE itself you add two arrays and call DISTINCT on their results in one step! Okay, you might have anticipated this, but if a subarray were larger than just 2, is there a way to do this without a CASE, such as UNWIND again? Help me with etiquette here; should I make that a separate question on StackOveflow? – mojo2go Nov 22 '16 at 20:03
  • I guess it is on the cusp of needing a new question, but I decided to update my answer with my solution. – cybersam Nov 22 '16 at 20:37
  • Thanks @cybersam. I'll have to play with this to see how you managed the sliding window to step forward two elements at a time. – mojo2go Nov 23 '16 at 21:40