That is a fun exercise. :-)
The following octave
script randomly generates n
strings of length len
. Subsequently it calculates the hamming distance between all these strings.
What is done next is that strings are compared pairwise. If for example you search for [5 12 14]
, you will find the table N
to contain strings that are 5
and 12
apart as well as strings that are 12
and 14
apart. The next challenge is of course to find the circuit in which the ones that are 5
and 12
apart can be put together with the ones that are 12
and 14
apart in such way that the circuit "closes".
% We generate n strings of length len
n=50;
len=20;
% We have a categorical variable of size 4 (ABCD)
cat=4;
% We want to generate strings that correspond with the following hamming distance matrix
search=[5 12 14];
%search=[10 12 14 14 14 16];
S=squareform(search);
% Note that we generate each string totally random. If you need small distances it makes sense to introduce
% correlations across the strings
X=randi(cat-1,n,len);
% Calculate the hamming distances
t=pdist(X,'hamming')*len;
% The big matrix we have to find our little matrix S within
Y=squareform(t);
% All the following might be replaced by something like submatrix(Y,S) if that would exist
R=zeros(size(S),size(Y));
for j = 1:size(S)
M=zeros(size(Y),size(S));
for i = 1:size(Y)
M(i,:)=ismember(S(j,:),Y(i,:));
endfor
R(j,:)=all(M');
endfor
[x,y]=find(R);
% A will be a set of cells that contains the indices of the columns/rows that will make up our submatrices
A = accumarray(x,y,[], @(v) {sort(v).'});
% If for example the distance 5 doesn't occur at all, we can already drop out
if (sum(cellfun(@isempty,A)) > 0)
printf("There are no matches\n");
return
endif
% We are now gonna get all possible submatrices with the values in "search"
C = cell(1, numel(A));
[C{:}] = ndgrid( A{:} );
N = cell2mat( cellfun(@(v)v(:), C, 'UniformOutput',false) );
N = unique(sort(N,2), 'rows');
printf("Found %i potential matches (but contains duplicates)\n", size(N,1));
% We are now further filtering (remove duplicates)
[f,g]=mode(N,2);
h=g==1;
N=N(h,:);
printf("Found %i potential matches\n", size(N,1));
M=zeros(size(N),size(search,2));
for i = 1:size(N)
f=N(i,:);
M(i,:)=squareform(Y(f,f))';
endfor
F=squareform(S)';
% For now we forget about wrong permutations, so for search > 3 you need to filter these out!
M = sort(M,2);
F = sort(F,2);
% Get the sorted search string out of the (large) table M
% We search for the ones that "close" the circuit
D=ismember(M,F,'rows');
mf=find(D);
if (mf)
matches=size(mf,1);
printf("Found %i matches\n", matches);
for i = 1:matches
r=mf(i);
printf("We return match %i (only check permutations now)\n", r);
t=N(r,:)';
str=X(t,:);
check=squareform(pdist(str,'hamming')*len);
strings=char(str+64)
check
endfor
else
printf("There are no matches\n");
endif
It will generate strings such as:
ABAACCBCACABBABBAABA
ABACCCBCACBABAABACBA
CABBCBBBABCBBACAAACC