3

I am trying to create a list of strings from a hamming distance matrix. Each string must be 20 characters long with a 4 letter alphabet (A,B,C,D). For example, say I have the following hamming distance matrix:

   S1 S2 S3
S1  0  5 12
S2  5  0 14
S3 12 14  0

From this matrix I need to create 3 strings, for example:

S1 = "ABBBBAAAAAAAAAABBBBB"
S2 = "BAAAAAAAAAAAAAABBBBB"
S3 = "CBBBABBBBBBBBBBBBBBB"

I created these strings manually, but I need to do this for a hamming distance matrix representing 100 strings which is not practical to do manually. Can anyone suggest an algorithm that can do this?

Thanks, Chris

ChrisUofR
  • 73
  • 6

1 Answers1

1

That is a fun exercise. :-)

The following octave script randomly generates n strings of length len. Subsequently it calculates the hamming distance between all these strings.

What is done next is that strings are compared pairwise. If for example you search for [5 12 14], you will find the table N to contain strings that are 5 and 12 apart as well as strings that are 12 and 14 apart. The next challenge is of course to find the circuit in which the ones that are 5 and 12 apart can be put together with the ones that are 12 and 14 apart in such way that the circuit "closes".

% We generate n strings of length len
n=50;
len=20;

% We have a categorical variable of size 4 (ABCD)
cat=4;

% We want to generate strings that correspond with the following hamming distance matrix
search=[5 12 14];
%search=[10 12 14 14 14 16];
S=squareform(search);

% Note that we generate each string totally random. If you need small distances it makes sense to introduce 
% correlations across the strings
X=randi(cat-1,n,len);

% Calculate the hamming distances
t=pdist(X,'hamming')*len;

% The big matrix we have to find our little matrix S within
Y=squareform(t);

% All the following might be replaced by something like submatrix(Y,S) if that would exist
R=zeros(size(S),size(Y));
for j = 1:size(S)
    M=zeros(size(Y),size(S));
    for i = 1:size(Y)
        M(i,:)=ismember(S(j,:),Y(i,:));
    endfor
    R(j,:)=all(M');
endfor

[x,y]=find(R);

% A will be a set of cells that contains the indices of the columns/rows that will make up our submatrices
A = accumarray(x,y,[], @(v) {sort(v).'});

% If for example the distance 5 doesn't occur at all, we can already drop out
if (sum(cellfun(@isempty,A)) > 0)  
    printf("There are no matches\n");
    return
endif

% We are now gonna get all possible submatrices with the values in "search"
C = cell(1, numel(A));
[C{:}] = ndgrid( A{:} );

N = cell2mat( cellfun(@(v)v(:), C, 'UniformOutput',false) );
N = unique(sort(N,2), 'rows');

printf("Found %i potential matches (but contains duplicates)\n", size(N,1));

% We are now further filtering (remove duplicates)
[f,g]=mode(N,2);
h=g==1;
N=N(h,:);

printf("Found %i potential matches\n", size(N,1));

M=zeros(size(N),size(search,2));
for i = 1:size(N) 
    f=N(i,:);
    M(i,:)=squareform(Y(f,f))';
endfor

F=squareform(S)';

% For now we forget about wrong permutations, so for search > 3 you need to filter these out!
M = sort(M,2);
F = sort(F,2);

% Get the sorted search string out of the (large) table M
% We search for the ones that "close" the circuit
D=ismember(M,F,'rows');
mf=find(D);

if (mf) 
    matches=size(mf,1);
    printf("Found %i matches\n", matches);  
    for i = 1:matches
        r=mf(i);
        printf("We return match %i (only check permutations now)\n", r);
        t=N(r,:)';
        str=X(t,:);
        check=squareform(pdist(str,'hamming')*len);
        strings=char(str+64)
        check
    endfor
else
    printf("There are no matches\n");
endif

It will generate strings such as:

ABAACCBCACABBABBAABA
ABACCCBCACBABAABACBA
CABBCBBBABCBBACAAACC
Anne van Rossum
  • 3,091
  • 1
  • 35
  • 39
  • Thanks Anne! This method does work, but Octave is throwing an "out of memory" error when trying to do for more strings (11 strings). First time using Octave so may be my ignorance, but specifically using: search =[0 0 3 3 5 5 5 3 8 8 0 3 3 5 5 5 3 8 8 3 3 5 5 5 3 8 8 3 2 2 2 3 8 8 5 5 5 6 8 11 0 0 3 9 8 0 3 9 8 3 9 8 8 5 8] with n=10000 (i have used lower n, but returns "There are no matches"). – ChrisUofR Dec 15 '14 at 18:59
  • @ChrisUofR. This algorithm is useful if you want 100 strings with the Hamming distance matrix you specified (or only slightly bigger). You have to run this algorithm >100 times for that. It won't find a set of strings each time; you have to run it a few times until it does. With a matrix of 11x11 it will create an enormous search space. You will need to find an optimized algorithm for that. Moreover, distances such as 0, 2, and 3 are rarely encountered with current random generator. So, you also need to feed it with a different prior than `randi`. – Anne van Rossum Dec 16 '14 at 11:23
  • Moreover, you have to be careful where you get large matrices from. Already a matrix as small as `search=[10 0 0]` is unsolvable: `S_1` is different from `S_2` and the same as `S_3`. However, `S_2` is the same as `S_3`, which leads to a contradiction. – Anne van Rossum Dec 16 '14 at 11:34
  • I really appreciate the help. I see the confusion is with my wording. I need the 100 strings that satisfy a 100x100 hamming distance matrix. – ChrisUofR Dec 17 '14 at 18:46
  • @ChrisUofR I would first find a way to check if this 100x100 hamming distance matrix is actually solvable. I think that problem on itself already deserves a separate question. :-) – Anne van Rossum Dec 18 '14 at 10:38