Optimally selecting n datapoints from k

Question

Problem statement:

I have 32k strings that consist of 13 characters. Each character can take 3 values (a, b or c). I need to select n strings from the 32k that satisfy the following:

select minimal number of strings so that the selected strings are not different than any other string within the 32k by more than 2 characters This means that the count of strings that needs to be selected is variable. Also, the strings are not randomly generated, so the average difference is less than 2/3*13 - meaning that the eventual count of strings to be selected is not astronomical.

What I tried so far:

Clustering with k++ initialization and then k-means using hamming distance - but this did not yield in the desired outcome, albeit the problem resembles a clustering problem in a sense that we are practically looking for cluster centers with cluster members within a radius of 2.

What I am thinking of is simply selecting that string which has the most other strings having a distance of 1 and then of 2 - afterwards take out all these from the 32k and then repeat the calculation until no strings are left, but this is likely to be a suboptimal solution, e.g. this way I would select more strings than what is required at minimum I believe (selecting additional strings is a cost)

Question:

What other algorithms should I consider or think of? Thanks!

score 1 · Answer 1 · answered Apr 23 '22 at 16:56

Here are examples of each method from my previous post. I always have trouble working code into my posts, so I did this separately. The first method computes the percentage that the strings are identical; the second method returns the number of differences.

string1 = ('abcbacaacbaab')
string2 = ('abcbacaacbbbb')

from difflib import SequenceMatcher
a=string1
b=string2
x = SequenceMatcher(a=a,b=b).ratio()
print(x)
#output: 0.8462

#OR (I used pip3 install jellyfish first)   
import jellyfish
x=jellyfish.damerau_levenshtein_distance
        (a,b)  
print(x)  
#output: 2

Thanks for your help. So far I used the following code to calculate the distance: from operator import eq def distance(a,b): count= 13-sum(map(eq, a, b)) return count It completed 1 million calculations in 1.6s. Sequence Matcher is much slower, clocking over 10s, whereas jellyfish is 0.3 seconds, which is much faster than my original solution, thank you! — PeterP, Apr 27 '22 at 15:48

score 0 · Answer 2 · answered Apr 23 '22 at 13:38

You might be able to use one of the types of 'fuzzy string matching' explained at:

https://miguendes.me/python-compare-strings#how-to-compare-two-strings-for-similarity-fuzzy-string-matching

There's "difflib" which computes a ratio of the differences. (You're in luck, your strings are all the same length.) There's also something called "jellyfish" that returns a character count of the differences. It sounds like an interesting assignment, good luck!

joaopfg · Answer 3 · 2022-04-23T17:33:41.903

if I understood, you want the minimum subset such that all elements in the subset are not different by more than two characters to the elements outside of the subset (please, let me know if I misunderstood the problem).

If that is the problem, there is a simple ad hoc algorithm that solves it in O(m * max(n, k)), where n is the total number of elements in the set (32000 in this case), m is the number of characters of an element of the set (13 in this case) and k is the size of the alphabet (3 in this case).

You can precalculate the quantity of each unique character of the alphabet in each column in O(m * max(n, k)). It's O(m * k) for initialization of the precalculation matrix and O(m * n) to actually calculate it.

Each column can vote for the removal of a string of the set if the character of the string in that column is equal to the number of strings in the initial set. Notice that a column can vote in O(1) using the precalculation. For each string, iterate through its columns and let the column vote. If you get three votes, you are sure the string needs to be kicked out of the set. So there is no need to continue iterating through the columns, just go to the next string. Else, the string needs to remain, just append it to the answer.

A python code is attached:

def solve(s: list[str], n: int = 32000, m: int = 13, k: int = 3) -> list[str]:
    pre_calc = [[0 for j in range(k)] for i in range(m)]
    ans = []

    for i in range(n):
        for j in range(m):
            pre_calc[j][ord(s[i][j]) - ord('a')] += 1

    for i in range(n):
        votes_cnt = 0
        remove = False

        for j in range(m):
            if pre_calc[j][ord(s[i][j]) - ord('a')] == n:
                votes_cnt += 1

                if votes_cnt == 3:
                    remove = True
                    break

        if remove is False:
            ans.append(s[i])

    if len(ans) == 0:
        ans.append(s[0])

    return ans

Thanks for your comment - I am trying this out. It will take some time, but I will post if it works later - thanks again! — PeterP, Apr 27 '22 at 15:50
Ok, let me know if there is some problem ;) If I am solving the right problem, this solution should be very fast. — joaopfg, Apr 27 '22 at 18:11
So I managed to run your code. It kept all strings in the output. I tried it with 15943 strings (1% of 3 power 13). I checked the pre_calc list of lists and the total count of votes was exactly 13*15943, so I am not sure, it seems that your first loop always results in a +1. The characters I use are always a, b and c. Your problem statement is very smooth, and yes, that is what I am trying to achieve. Thank you! — PeterP, May 04 '22 at 13:55

Optimally selecting n datapoints from k

3 Answers3