0

In Perl, I would like to sort a collection of different length strings in a way that automatically lumps together similar strings.

Intuitively, I imagine I need some distance measure for each pair and then a clustering routine that groups by the distance.

My number of strings is always small and they are short, see an example below.

Is there a simple approach that will do what I need in sort_magic_here?

#!/usr/bin/perl
use strict;

my @list =
  ("JK_HJ_Lancaster", "SY4_TS_HJ_1000ng",
   "NB_E_200cc_caHJ_Rep1", "HB_E_100cc_caHJ_Rep1",
   "HB_E_200cc_caHJ_Rep1", "Normal_Lancaster",
   "NB15_OP_HJ_1000ng","Zoey_HJ_Slough",
   "NB_E_100cc_caHJ_Rep1","Normal_Slough",
   "JK_caHJ_Slough","Zoey_HJ_Lancaster");

print "# Straight sort\n";
foreach my $elem (sort @list) {
  print "$elem\n";
}

print "# Sort grouped by string distance\n";
foreach my $elem (sort { sort_magic_here() }  @list) {
  print "$elem\n";
}
719016
  • 9,922
  • 20
  • 85
  • 158
  • 1
    When you hand a block/subroutine to `sort`, it will be handed two (and only two) of the items in the list to be sorted at any given time. It then answers the question "Does a come before b". If you'd like to order items in a way than needs to examine more than two items at a time, then `sort` may not be the hammer for you... – tjd Jun 22 '15 at 15:54
  • 1
    How should the strings in your example be sorted? – ThisSuitIsBlackNot Jun 22 '15 at 16:24

1 Answers1

2

Custom sorts take two inputs, perform a 'comparison' and respond with -1, 0 or 1 depending on whether they are before, after or equal.

Sorting is designed for making a positional order, not really for 'grouping stuff that's vaguely similar'.

You do have the Text::Levenshtein module which quickly lets you compute that - but you have to do something altogether more complicated because you'd need to compare each word to each other word before being able to decide ordering. But frankly, you will have the same problem with any 'similar words' sort of comparison.

In this, you're starting to look at graph theory and grouping based on that. It's quite a complicated problem though - it's far from as trivial as 'just' sorting.

I'd be looking at something like:

#!/usr/bin/perl
use strict;
use warnings;

use Text::Levenshtein qw ( distance );
use Data::Dumper;

my @list = (
    "JK_HJ_Lancaster",      "SY4_TS_HJ_1000ng",
    "NB_E_200cc_caHJ_Rep1", "HB_E_100cc_caHJ_Rep1",
    "HB_E_200cc_caHJ_Rep1", "Normal_Lancaster",
    "NB15_OP_HJ_1000ng",    "Zoey_HJ_Slough",
    "NB_E_100cc_caHJ_Rep1", "Normal_Slough",
    "JK_caHJ_Slough",       "Zoey_HJ_Lancaster"
);

my %distances;

foreach my $elem (@list) {
    foreach my $compare (@list) {
        next if $elem eq $compare;
        my $distance = distance( $elem, $compare );
        $distances{$elem}{$compare} = $distance;
    }
}

print Dumper \%distances;

my %seen;
my ($cursor) = sort @list;

while ($cursor) {
    print "$cursor\n";
    $seen{$cursor}++;
    my @near_words_in_order =
        sort { $distances{$cursor}{$a} <=> $distances{$cursor}{$b} }
        keys %{ $distances{$cursor} };

    #      print @near_words_in_order;
    last unless @near_words_in_order;
    while ( $seen{$cursor} ) {
        $cursor = shift(@near_words_in_order) // 0;
    }
}

Which gives the result:

HB_E_100cc_caHJ_Rep1
HB_E_200cc_caHJ_Rep1
NB_E_200cc_caHJ_Rep1
NB_E_100cc_caHJ_Rep1
NB15_OP_HJ_1000ng
SY4_TS_HJ_1000ng
Zoey_HJ_Slough
JK_caHJ_Slough
Normal_Slough
Normal_Lancaster
JK_HJ_Lancaster
Zoey_HJ_Lancaster

Which at least approximately groups like you request. You can probably get this more efficient, because you don't need to compute all the distances which'll reduce the algorithm complexity. But you also will get different groups based on proximity and start point.

Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • Thanks very much for this. There is an unanticipated case which is when the list is of size smaller than 3, where the loop does not print all values... – 719016 Jun 23 '15 at 09:18
  • 1
    Well spotted, yes. That's because the 'last' was firing on the empty array (having shifted out the last element for printing). Amended slightly, looks like it works for that scenario now. – Sobrique Jun 23 '15 at 09:49
  • This is a somewhat late comment, but would you consider adding a description of what you are doing here for people who don't know Perl and don't particularly want to learn it? :-) – Faheem Mitha Aug 05 '20 at 20:54