Calculate hamming distance in perl

Question

I have the following list of words (words.txt) in a file shown in IPA characters (international phonetic alphabet).

Below, I have assigned each IPA character with a binary code in a separate file (sounds.txt). I want to compare each word in the words.txt file using the values for each "character" ( for example "b" or "ŋ" as below) from the sounds.txt file.

I want to print the words and their number value results to a separate file.

First desired output example: the output value for bʀɥi and fʀɥi will be 5 because the two binary strings for the characters "b" and "f" differ in 5 places.

"b":[10000100000000010000]
"f":[00100010000000000000]

Second example: the output value for bʀɥi and plɥi will be 6 because the characters "b" and "p" differ in 1 place and the characters "ʀ" and "1" differ in 5 places. The final value for the calculation of each pair of words is a sum of the differences in the binary code for each character.

"b":[10000100000000010000]
"p":[10000100000000000000]

"ʁ":[00100000000001010000]
"l":[00011000100000010000]

I know the code for calculating each individual letter is going to look something like this but I'm not sure how to incorporate the values from the sound.txt file and then getting the compared values from two whole words. I've been reading through a lot of perl tutorials but nothing I've seen yet seems similar to what I want to accomplish. Any advice would be great.

open(my $f1, "words.txt");
    string1 [$f1]
    string2 [$f1]
        for (i=0,i<string.length,i++)
            if(string1[i]!=string2[i])
                    sum = sum+1

bʀɥi
kʀwa
dʀwa
fʀwa
fʀɥi
ɡʀwɛ̃
plɥi
pʀwa
tʀɥi

"p":[10000100000000000000]
"b":[10000100000000010000]
"f":[00100010000000000000]
"v":[00100010000000010000]
"t":[10000001000000000000]
"d":[10000001000000010000]
"k":[10000000000010000000]
"g":[10000000000010010000]
"s":[00100000100000000000]
"z":[00100000100000010000]
"m":[01000100000000010000]
"n":[01000001000000010000]
"ɲ":[01000000001000010000]
"ŋ":[01000000000010010000]
"ʃ":[00100000010000000000]
"ʒ":[00100000010000010000]
"ʀ":[00100000000001010000]
"w":[00010000000000110000]
"j":[00010000001000010000]
"ɥ":[00010000000100010000]
"l":[00011000100000010000]
"a":[00000000001000011000]
"ɑ":[00000000000010011000]
"ɑ̃":[01000000000010011000]
"e":[00000000001000010010]
"ɛ":[00000000001000010100]
"ɛ̃":[01000000001000010100]
"ə":[00000000000000000000]
"i":[00000000001000010001]
"o":[00000000000000110010]
"ɔ":[00000000000000110100]
"ɔ̃":[01000000000000110100]
"œ":[00000000000100010100]
"œ̃":[01000000000100010100]
"ø":[00000000000100010010]
"u":[00000000000000110001]
"y":[00000000000100010001]

choroba · Answer 1 · 2015-07-17T11:29:31.830

Store the mapping from IPA characters to the binary codes in a hash. You can't simply break each word into characters and map them to the hash, as some of the "characters" are not represented by a single codepoint in Unicode. So, I just replaced each known combination by the code, and then used XOR to remove common ones or zeros.

Some of the characters are missing in your sample, I had to add them (ʀ and ɡ).

#!/usr/bin/perl
use warnings;
use strict;

use open IO => 'encoding(utf-8)', ':std';

my @words;
open my $WORDS, '<:encoding(utf-8)', 'words.txt' or die $!;
chomp(@words = <$WORDS>);

my %sound;
open my $SOUNDS, '<:encoding(utf-8)', 'sounds.txt' or die $!;
while (<$SOUNDS>) {
    my ($ipa, $features) = /"(.*?)":\[([01]+)\]/;
    $sound{$ipa} = $features;
}

my $chars = join '|', sort { length $b <=> length $a } keys %sound;
my $regex = qr/($chars)/;

my @sounds;
for my $word (@words) {
    (my $wsound = $word) =~ s/$regex/$sound{$1},/g; # / SO bug
    push @sounds, $wsound;
}

for my $i1 (0 .. $#words - 1) {
    for my $i2 ($i1 + 1 .. $#words) {
        warn "Different length: $words[$i1] - $words[$i2]"
            if length $sounds[$i1] != length $sounds[$i2];
        my $hamming = $sounds[$i1] ^ $sounds[$i2];
        $hamming =~ tr/\0//d;
        $hamming = length $hamming;
        print "$words[$i1] - $words[$i2] : $hamming\n";
    }
}

Thanks for your help. I ran the code but the results came back very differently than I was expecting. For example I got bʀɥi - fʀɥi : 1 I was expecting bʀɥi - fʀɥi : 5. — Mck18, Jul 17 '15 at 11:20
@Mck18: Are you sure no characters are missing in sounds.txt? Check the updated script for a check. I'm getting 5. — choroba, Jul 17 '15 at 11:27
Nevermind, the code worked perfectly, the sounds file wasn't in Unicode for somereason! Thank you so much for your help! — Mck18, Jul 17 '15 at 11:47
@choroba IMHO in this sort of comparison it gives plausible results, if the shorter word is padded with \0 (sound "ə"). Maybe the two words should be aligned by the Levenshtein algorithm before, if this makes phonetically sense. — Helmut Wollmersdorfer, Nov 12 '19 at 17:46

Calculate hamming distance in perl

1 Answers1