Interleaving sparse sorted arrays

Question

I've got a set of lists of events. The events always happen in a given order, but not every event always happens. Here's an example input:

[[ do, re, fa, ti ],
 [ do, re, mi ],
 [ do, la, ti, za ],
 [ mi, fa ],
 [ re, so, za ]]

The input values don't have any inherent order. They're actually messages like "creating symlinks" and "reindexing search". They're sorted in the individual list, but there's no way to look at only 'fa' in the first list and 'mi' in the second and determine which comes before the other.

I'd like to be able to take that input and generate a sorted list of all events:

[ do, re, mi, fa, so, la, ti, za ]

or better yet, some information about each event, like a count:

[ [do, 3], [re, 3], [mi, 2],
  [fa, 2], [so, 1], [la, 1],
  [ti, 1], [za, 2] ]

Is there a name for what I'm doing? Are there accepted algorithms? I'm writing this in Perl, if that matters, but pseudocode will do.

I know that given my example input, I probably can't be guaranteed of the "right" order. But my real input has tons more datapoints, and I feel confident that with some cleverness it'll be 95% right (which is really all I need). I just don't want to re-invent the wheel if I don't have to.

Should be pretty easy in perl - use a hash to do the counting up, then sort the keys and pull out exactly what you want. — Cascabel, Jul 09 '10 at 18:36
@Jefromi, unless I misunderstand OP is saying there is no ordering defined on the keys... — tzaman, Jul 09 '10 at 18:46
As I say, the values aren't inherently sortable. I've changed them from A-H to "do", "re", etc, to make that more obvious. — Plutor, Jul 09 '10 at 18:48
I am not getting it: What determines the fact that `do` comes before `re` and `za` is last? — Sinan Ünür, Jul 09 '10 at 18:52
@Plutor - do you know the names of all possible events? Because if you have that, and they're guaranteed to occur in some specific order, then you have an ordering. — tzaman, Jul 09 '10 at 18:55
"do" comes before "re" in all input lists that contain both of them. "za" isn't necessarily last, but it always comes after "do", "la", "ti", "re", and "so", since it's after those ones in the input lists. — Plutor, Jul 09 '10 at 18:55
@tzaman - I know the names and order of _most_ of the events, but many of them are programmatically generated. It might be possible to do the in-advance ordering, but I was hoping a smart algorithm could save me the work. — Plutor, Jul 09 '10 at 18:58
I liked it better when the question was using A,B,C; not, do-re-mi — vol7ron, Jul 09 '10 at 20:35
@tzaman: Oh. The question implied that an ordering existed by saying "sorted list of all events". I didn't quite realize that the point of the question was to determine the ordering. — Cascabel, Jul 09 '10 at 22:47
You should have opened up a new question for this. The answers here all pertain to the old question and now this is cluttered with answers to different questions. — , Jul 10 '10 at 17:38
In order to sort, you need something to sort by. What it seems like you want is a relational rules algorithm to do the sort. **Example:** `do` always comes before `re`, `fa` always comes after `re`. So when we sort the list, apply the algorithm to the unordered list and hopefully have a set that looks like: *do*, *re*, mi, *fa*. -- Yes, you'll need 1 process to scan the unordered list for the positions/rules and the next to uniquely sort that list. — , Jul 10 '10 at 17:42
**Does the pattern/events ever repeat in the same array?** example: [ *do*, *re*, *mi*, ... *za*, *do*, *re* ],[ *do*, *mi*, *fa* ],[ *la*, *ti*, *za*, *do* ] — vol7ron, Jul 10 '10 at 23:50

score 3 · Answer 1 · edited Jul 09 '10 at 19:29

3

Theoretically speaking, let me suggest the following algorithm:

Build a directed graph.
For each input [ X, Y, Z ], create the edges X->Y and Y->Z if they're not already there.
Perform a topological sorting of the graph.
Voila!

PS
This is only assuming that all events occur in a specific order (always!). If that's not the case, the problem becomes NP-Complete.

PPS
And just so that you have something useful: Sort::Topological (don't know if it actually works but it seems right)

edited Jul 09 '10 at 19:29

Sinan Ünür

116,958
15
196
339

answered Jul 09 '10 at 18:48

adamk

45,184
7
50
57

+1 for first answer that actually works for unordered elements. – tzaman Jul 09 '10 at 18:50

score 3 · Accepted Answer · answered Jul 09 '10 at 20:22

You can use tsort to infer a reasonable—although not necessarily unique—sort order (known as a topological order) from the ordering you've observed. You may be interested in reading tsort's original use, which is similar in structure to your problem.

Note that tsort requires an acyclic graph. In terms of your example, this means you couldn't see do followed by re in one sequence and re followed by do in another.

#! /usr/bin/perl

use warnings;
use strict;

use IPC::Open2;

sub tsort {
  my($events) = @_;

  my $pid = open2 my $out, my $in, "tsort";

  foreach my $group (@$events) {
    foreach my $i (0 .. $#$group - 1) {
      print $in map "@$group[$i,$_]\n", $i+1 .. $#$group;
    }
  }

  close $in or warn "$0: close: $!";

  chomp(my @order = <$out>);
  my %order = map +(shift @order => $_), 0 .. $#order;
  wantarray ? %order : \%order;
}

Because you described the data as sparse, the code above provides tsort with as much information as possible about the events' adjacency matrix.

Having that information, computing a histogram and sorting its components is straightforward:

my $events = [ ... ];

my %order = tsort $events;

my %seen;
do { ++$seen{$_} for @$_ } for @$events;

my @counts;
foreach my $event (sort { $order{$a} <=> $order{$b} } keys %seen) {
  push @counts => [ $event, $seen{$event} ];
  print "[ $counts[-1][0], $counts[-1][1] ]\n";
}

For the input in your question you provided, the output is

[ do, 3 ]
[ la, 1 ]
[ re, 3 ]
[ so, 1 ]
[ mi, 2 ]
[ fa, 2 ]
[ ti, 2 ]
[ za, 2 ]

This looks funny because we know the order of solfège, but re and la are incomparable in the partial order defined by $events: we know only that they must both come after do.

This is the best solution because it doesn't require me to implement the actual algorithm. — Plutor, Jul 13 '10 at 15:15

score 2 · Answer 3 · answered Jul 09 '10 at 20:06

If you're not into writing to much code, you could use the unix command-line utility tsort:

$ tsort -
do re
re fa
fa ti
do re
re mi
do la
la ti
ti za
mi fa
re so
so za

Which is a list of all pairs in your sample input. This produces as output:

do
la
re
so
mi
fa
ti
za

which is basically what you want.

score 1 · Answer 4 · answered Apr 21 '11 at 15:54

Use a hash to aggregate.

my $notes= [[qw(do re fa ti)],
       [qw(do re mi)],
       [qw(do la ti za)],
       [qw(mi fa)],
       [qw(re so za)]];

my %out;
foreach my $list (@$notes)
{
  $out{$_}++ foreach @$list;
}

print "$_: $out{$_}\n" foreach sort keys %out;

Yields

do: 3
fa: 2
la: 1
mi: 2
re: 3
so: 1
ti: 2
za: 2

The %out hash is easily converted into a list if that is what you want.

my @newout;
push @newout,[$_,$out{$_}] foreach sort keys %out;

score 0 · Answer 5 · answered Jul 09 '10 at 18:42

0

perl -de 0
  DB<1> @a = ( ['a','b','c'], ['c','f'], ['h'] ) 
  DB<2> map { @m{@{$_}} = @$_ } @a
  DB<3> p keys %m
chabf

Quickiest shortcut I can think of. Either way, you have to iterate through things at least once...

answered Jul 09 '10 at 18:42

eruciform

7,680
1
35
47

This doesn't work because *order* is the most important bit. Your output has "c" before "a" and "b", despite the fact that your first input list has "c" after both of them. – Plutor Jul 09 '10 at 18:47
yeah, this was a simplistic solution. but to be fair, the question said that they didn't care about ordering. :-) – eruciform Jul 09 '10 at 21:48

score 0 · Answer 6 · answered Jul 09 '10 at 18:45

This is a perfect candidate for a Merge Sort. Go to the wikipedia page here for a pretty good representation of the algorithm http://en.wikipedia.org/wiki/Merge_sort

What you have described is actually a subset/small tweak of the merge sort. Instead of starting with an unsorted array, you have a set of sorted arrays that you want to merge together. Just call the "merge" function as described in the wikipedia page on pairs of your arrays and the results of the merge function until you have a single array (which will be sorted).

To tweak the output to the way you want, you'll need to define a comparison function that can return if one event is less than, equal to, or greater than a different event. Then, when your merge function finds two events that are equal, you can collapse them into a single event and keep a count for that event.

The values aren't inherently sortable. I've changed them from A-H to "do", "re", "mi", etc, to make that more clear. — Plutor, Jul 09 '10 at 19:03
Ah -- I get it. In your example, the arrays were sorted by the event -- you're saying that the events in the original arrays are sorted by a different key, and while that key has been "lost", you want to retain the original order of each part of the array, right? There's a way to do it fairly easily with the merge sort by re-writing the comparison function to return less than and greater than based on the original order of the inputs. I'll th ink about it after lunch :-) — J Teller, Jul 09 '10 at 19:11

score 0 · Answer 7 · answered Jul 09 '10 at 19:31

Roughly, the name I would give it is "hashing". You are putting things into name value pairs. If you want to keep some semblance of order, you have to supplement the hash with an array that keeps order. That order is "encounter order" for me.

use strict;
use warnings;

my $all 
    = [[ 'do', 're', 'fa', 'ti' ],
       [ 'do', 're', 'mi' ],
       [ 'do', 'la', 'ti', 'za' ],
       [ 'mi', 'fa' ],
       [ 're', 'so', 'za' ]
     ];

my ( @order, %counts );

foreach my $list ( @$all ) { 
    foreach my $item ( @$list ) { 
        my $ref = \$counts{$item}; # autovivs to an *assignable* scalar.
        push @order, $item unless $$ref;
        $$ref++;
    }
}

foreach my $key ( @order ) { 
    print "$key: $counts{$key}\n";
}

# do: 3
# re: 3
# fa: 2
# ti: 2
# mi: 2
# la: 1
# za: 2
# so: 1

There are other answers like this one, but mine contains this neat autovivification trick.

score 0 · Answer 8 · answered Jul 09 '10 at 20:42

I'm not really sure what this would be called either, but I figured out a way to find the order given the array of arrays as an input. Essentially the pseudo-code is:

10 Find earliest item in all arrays
20 Push that onto a list
30 Remove that item from all arrays
40 Goto 10 if there are any items left

Here's a working prototype:

#!/usr/bin/perl

use strict;

sub InList {
    my ($x, @list) = @_;
    for (@list) {
        return 1 if $x eq $_;
    }
    return 0;
}

sub Earliest {
    my @lists = @_;
    my $earliest;
    for (@lists) {
        if (@$_) {
            if (!$earliest
                || ($_->[0] ne $earliest && InList($earliest, @$_))) {

                $earliest = $_->[0];
            }
        }
    }
    return $earliest;
}

sub Remove {
    my ($x, @lists) = @_;

    for (@lists) {
        my $n = 0;
        while ($n < @$_) {
            if ($_->[$n] eq $x) {
                splice(@$_,$n,1);
            }
            else {
                $n++
            }
        }
    }
}

my $list = [
    [ 'do', 're', 'fa', 'ti' ],
    [ 'do', 're', 'mi' ],
    [ 'do', 'la', 'ti', 'za' ],
    [ 'mi', 'fa' ],
    [ 're', 'so', 'za' ]
];

my @items;

while (my $earliest = Earliest(@$list)) {
    push @items, $earliest;
    Remove($earliest, @$list);
}

print join(',', @items);

Output:

do,re,mi,fa,la,ti,so,za

score 0 · Answer 9 · edited Jun 20 '20 at 09:12

Solution:

This solves the original question before it was modified by the asker.

#!/usr/local/bin/perl -w
use strict; 

   main();
    
   sub main{
      # Changed your 3-dimensional array to a 2-dimensional array
      my @old = (
                   [ 'do', 're', 'fa', 'ti' ],
                   [ 'do', 're', 'mi' ],
                   [ 'do', 'la', 'ti', 'za' ],
                   [ 'mi', 'fa' ],
                   [ 're', 'so', 'za' ]
                );
      my %new;

      foreach my $row (0.. $#old ){                           # loop through each record (row)
         foreach my $col (0..$#{$old[$row]} ){                # loop through each element (col)                    
            $new{ ${$old[$row]}[$col] }{count}++;
            push @{ $new{${$old[$row]}[$col]}{position} } , [$row,$col];
         }
      }

      foreach my $key (sort keys %new){
         print "$key : $new{$key} " , "\n";                   # notice each value is a hash that we use for properties 
      }      
   }

How to Retrieve Info:

   local $" = ', ';                       # pretty print ($") of array in quotes
   print $new{za}{count} , "\n";          # 2    - how many there were
   print "@{$new{za}{position}[1]} \n";   # 4,2  - position of the second occurrence
                                          #        remember it starts at 0

Basically, we create a unique list of elements in the hash. For each of those elements we have a "property" hash, that contains a scalar count and an array for the position. The number of elements in the array should vary, based on how many occurrences of the element were in the original.

The scalar property isn't really necessary since you could always take the scalar of the position array to retrieve the same number. Note: if you ever add/remove elements from the array count and position will not be correlate in their meaning.

example: print scalar @{$new{za}{position}}; will give you the same as print $new{za}{count};

This would look a little cleaner if you just had a 2-dimensional array, rather than an unnecessary 3-dimensional array. — vol7ron, Jul 09 '10 at 21:23
Thanks for the example on how to get the data! I think the hash of a hash is what he'd be interested in — , Jul 10 '10 at 17:37

pdehaan · Answer 10 · 2010-07-10T15:39:49.183

0

Just realized your question said their is no predetermined order, so this may not be relevent.

Perl code:

$list = [
    ['do', 're', 'fa', 'ti' ],
    ['do', 're', 'mi' ],
    ['do', 'la', 'ti', 'za' ],
    ['mi', 'fa' ],
    ['re', 'so', 'za' ]
];
%sid = map{($_,$n++)}qw/do re mi fa so la ti za/;

map{map{$k{$_}++}@$_}@$list;
push @$result,[$_,$k{$_}] for sort{$sid{$a}<=>$sid{$b}}keys%k;

print "[@$_]\n" for(@$result);

output:

[do 3]
[re 3]
[mi 2]
[fa 2]
[so 1]
[la 1]
[ti 2]
[za 2]

edited Jul 10 '10 at 15:39

answered Jul 10 '10 at 15:32

pdehaan

332
2
13

this only works if you know and list the original order of all possible events. – vol7ron Jul 10 '10 at 18:38

Interleaving sparse sorted arrays

10 Answers10

Solution:

How to Retrieve Info: