-2

I m looking for a sorting algorithm tolerant to errors.

Suppose I have a list A and a list B with some spelling errors. Here zanana instead of banana:

A = ["raspberry","apple", "banana", "peach"] 
B = ["raspberry","apple", "zanana", "peach"] 

I would like a Sort function, which returns the same new list C using A or B as input.

Sort(A) = C
Sort(B) = C 

For instance:

C = [3,2,1,0]

If I use a classical lexicographic order, I would not have the same result because zanana and banana would have different index.

I'm trying to think whether I can use a hamming distance or something like that, but without success.

Note: This is a toy example. In real example, instead of fruits, I have long string of differents sizes. Instead of banana and zanana, i have :

 - 4df7e8ea5e6ee731ec4877a99fffc8da
 - 5df7e8ea5e6ee731ec4877a99fffc8da
dridk
  • 180
  • 2
  • 13
  • 4
    can you explain in plain words how you would sort list `B` as a human? E.g. how would you assotiate 'zanana` with 'banana' and not, say, 'zagana' or 'zanava'? – Marat Jul 12 '23 at 17:05
  • 2
    This specific case seems "obvious" to a human but I've often had auto-correct software trying to "fix" less-common words by replacing them with more-common words. I think anything you designed would need to have domain knowledge, like (the incoming data is a list of fruits) and (this is a database of valid fruit names) to check for invalid names and then attempt to correct them. This would be a separate processing step before the sort step. A data cleaning step. – Dave S Jul 12 '23 at 17:08
  • Also consider "mre" -- was it supposed to be more, mere, mire, mare, or maybe MRE (acronym with multiple different meanings like Meal Ready to Eat and Minimal Reproducible Example). – Dave S Jul 12 '23 at 17:31
  • @Marat in my example, banana and zanana should have the same rank. Not matter which position. I just want 2 close lists to be ordered in the same way. – dridk Jul 12 '23 at 19:42
  • @DaveS In a real example, I have long sentence of letter with no meaning. For instance, instead of banana and zanana : - 4df7e8ea5e6ee731ec4877a99fffc8da - 4df8e8ea5e6ee731ec4877a99fffc8da – dridk Jul 12 '23 at 19:44
  • Sorry, I just realized that I wasn't clear. The sort function returns a rank list. For instance, C = [3,2,1,0] – dridk Jul 12 '23 at 19:50
  • How do you know which digits are incorrect, and what the correct digits should be? by rank, 5... > 4... but if the correct digit should be 9 the rank is reversed 9... > 5.... . For Hamming, what would you compare to what? – Dave S Jul 12 '23 at 19:57
  • 1
    @dridk a stupid way of doing that would be to sort by the number of positive bits in the string, e.g. `sorted(A, key=lambda s: sum(byte.bit_count() for byte in s.encode('utf8')))`. It is stupid because it is sensitive to string length and will get too many collisions on long lists – Marat Jul 12 '23 at 20:24
  • 1
    This might be closed for "lack of details or clarity" since the question doesn't tell us how the ranking should work in the presence of random invalid digits -- these are not words, so unlike the fruit names we have no idea how to spot the bad digits or how to compute an approximate ranking. – Dave S Jul 12 '23 at 21:27
  • @DaveS I don't know how rank should work. This is my question. Maybe the solution proposed above can help to understand my question. Sorry . – dridk Jul 12 '23 at 22:39
  • 1
    What aspect of "sorting" do you care about? Are you just trying to group strings that have many of the same characters? Should the ordering of those characters matter (should `cheap` show up closer to `cherry` or `peach`)? Does it actually matter that banana appears after apple and before peach, or does it only matter that banana and zanana tend to show up in roughly the same place when the other list elements are all the same? – StriplingWarrior Jul 12 '23 at 22:56
  • so sort-of the distance from 0 as the rank? would the sum of digits treated as unsigned int character codes work better than the sum of bits? so "129a" = 49 + 50 + 57 + 97 ? I'm trying to figure out what you want "rank" to mean and why. – Dave S Jul 12 '23 at 22:56

2 Answers2

0

One solution proposed above works pretty much .

def encode(x):
   return sum([b.bit_count() for b in x.encode()])


a = ['peach', 'banana', 'apple']
b = ['peach', 'zanana', 'apple']

c_a = sorted(a, key = encode)
c_b = sorted(n, key = encode)

print(c_a) # ['peach', 'apple', 'banana']
print(c_b) # ['peach', 'apple', 'zanana']
dridk
  • 180
  • 2
  • 13
0

I found a solution which works as expected :

  • vectorize each word of the list into n-gram space

  • Project each word to one space using Sparse PCA

  • Sort word according eigen value of this space

    a  = ["banana", "apple","kiwi", "orange", "table", "chair", "charly"]
    b  = ["banana", "apple","kiwi", "Zrange", "table", "Zhair", "charly"]
    
    
    def sort_ngram(array):
        # Vectorize
        model = CountVectorizer(ngram_range = (2,2), analyzer="char")
        matrix = model.fit_transform(array).toarray()
    
        df = pd.DataFrame(matrix)
        df.apply(lambda x: sorted(x))
    
        # Project to one space 
        model = SparsePCA(n_components=1)
        df = pd.DataFrame(model.fit_transform(df))
        return df.sort_values(0).index.to_list()
    
    print(sort_ngram(a))
    # ['apple', 'table', 'chair', 'kiwi', 'banana', 'charly', 'orange']
    
    print(sort_ngram(b))
    # ['apple', 'table', 'Zhair', 'kiwi', 'banana', 'charly', 'Zrange']
    
dridk
  • 180
  • 2
  • 13