1

I am looking for the fastest way to find every single character mis-match between every word in a large file. If I have this:

AAAA
AAAB
AABA
BBBB
CCCC

I would like to get something like this:

AAAA - AAAB AABA
AAAB - AAAA
AABA - AAAA
BBBB
CCCC

Currently I am using agrep but as my file is millions of lines long and it is very slow. Each word is on its own line and they are all the same number of characters. I expect there is something elegant I have not been able to find. thank you

Edit: The words are made up of just 5 characters, A T C G or N and they are just under 100 characters long. The whole thing should fit in memory (<5GB). There is one word per line and I want to compare it to every other word.

Edit2: The example was not correct It is fixed now.

user1168246
  • 33
  • 1
  • 5
  • tell us more about your actual data; are these real words of greatly varying length, or are they a fixed or limited range of length or using a subset of characters? – ysth Dec 07 '14 at 06:48
  • 2
    you have just one word per line and you want to compare each word to every other word in the file? – ysth Dec 07 '14 at 06:49
  • 2
    and "AABA" and "AAAB" don't fit my definition of a single character mismatch; is this an error? if not, what is your definition? – ysth Dec 07 '14 at 07:05
  • 1
    how long are your lines? – Patrick J. S. Dec 07 '14 at 07:18
  • 1
    It is critical to know how long each of your "words" is. It dictates whether the entire list can be kept in memory. – Borodin Dec 07 '14 at 07:47

2 Answers2

4

If you are looking for words that have only a one-character difference, there are a couple of tricks you can use. First, to compare two words and count the number of characters different, you use this:

( $word1 ^ $word2 ) =~ tr/\0//c

This does a stringwise exclusive or on the two words; wherever the characters are the same, a "\0" will result; where they are not the same, a non-"\0" will result. tr, in its complement counting mode, counts the differences.

Second, noting that either the first half or the last half of the word must match exactly, partition the words into a hash by their first and last halves, reducing the number of other words you need to check a given word against.

This approach should only two or three times the memory of all the strings (plus a little overhead); it could be reduced to one to two times the memory by pushing \$word and using $$_ in the grep and sort map $$_, @match in the output at the cost of some speed.

If the words are all the same length, the top level of the hash can be removed and two different hashes used for the words' beginnings and endings.

use strict;
use warnings;
use autodie;
my %strings;

my $filename = shift or die "no filename provided\n";
open my $fh, '<', $filename;
while (my $word = readline $fh) {
    chomp $word;
    push @{ $strings{ 'b' . length $word }{ substr($word, 0, length($word)/2)} }, $word;
    push @{ $strings{ 'e' . length $word }{ substr($word, length($word)/2)} }, $word;
}
seek $fh, 0, 0;
while (my $word = readline $fh) {
    chomp $word;
    my @match = grep 1 == ($word ^ $_) =~ tr/\0//c, @{ $strings{ 'b' . length $word }{ substr($word, 0, length($word)/2) } }, @{ $strings{ 'e' . length $word }{ substr($word, length($word)/2) } };
    if (@match) {
        print "$word - " . join( ' ', sort @match ) . "\n";
    }
    else {
        print "$word\n";
    }
}

Note that this only looks for substitutions, not insertions, deletions, or transpositions.

ysth
  • 96,171
  • 6
  • 121
  • 214
  • out of curiosity, how long did this take to run (and against how many words)? – ysth Dec 07 '14 at 20:35
  • With such a small input alphabet, you could convert to some sort of bitmap representation if you want to trade human readability for compactness of representation. You could squeeze this down to three bits per symbol but I would perhaps go for four just to keep byte alignment. Depending on distribution patterns, perhaps there could be a clever way to get it all the way down to two bits per symbol. – tripleee Dec 08 '14 at 07:00
2

It requires a large memory footprint, but the following can accomplish your task in two passes:

#!/usr/bin/env perl

use strict;
use warnings;

use Fcntl qw(:seek);

my $fh = \*DATA;

my $startpos = tell $fh;

my %group;

while (<$fh>) {
    chomp;

    my $word = $_;

    for my $i ( 0 .. length($word) - 1 ) {
        substr my $star = $word, $i, 1, "\0";
        push @{ $group{$star} }, \$word;
    }
}

seek $fh, $startpos, SEEK_SET;

while (<$fh>) {
    chomp;

    my %uniq;

    my $word = $_;

    for my $i ( 0 .. length($word) - 1 ) {
        substr my $star = $word, $i, 1, "\0";
        $uniq{$_}++ for map $$_, @{ $group{$star} };
    }

    delete $uniq{$word};

    print "$word - ", join(' ', sort keys %uniq), "\n";
}

__END__
AAAA
AAAB
AABA
BBBB
CCCC

Outputs:

AAAA - AAAB AABA
AAAB - AAAA
AABA - AAAA
BBBB - 
CCCC - 
Miller
  • 34,962
  • 4
  • 39
  • 60
  • well, I was going to wait for answers to my comments, but I'm going to post my answer, which is like your only much more efficient :) – ysth Dec 07 '14 at 07:14
  • Yeah, I think you've already clarified the inconsistency in his example. Look forward to your answer. – Miller Dec 07 '14 at 07:15
  • `++$uniq{$_}` is better for at least two reasons! It puts the "verb" first, and it doesn't imply that you need the pre-incremented value. There may also be language implementations around that don't optimise out the saving of the original value. – Borodin Dec 07 '14 at 08:11