How to get the best merger from symspellpy word segmentation of many languages in Python?

Question

The following code uses SymSpell in Python, see the symspellpy guide on word_segmentation.

It uses "de-100k.txt" and "en-80k.txt" frequency dictionaries from a github repo, you need to save them in your working directory. As long as you do not want to use any SymSpell logic, you do not need to install and run this script to answer the question, take just the output of the two language's word segmentations and go on.

import pkg_resources
from symspellpy.symspellpy import SymSpell

input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"

# German:
# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "de-100k.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

# English:
# Reset the sym_spell object
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

Out:

sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884

The aim is to find out the most relevant words by some logic: most frequent ngram neighours and/or word frequency, longest word, and the like. The logic is free of choice.

In this example with two languages, the two outputs need to be compared so that only the best segments are kept while dropping the rest, without interceptions of parts of words. In the outcome, each letter is used one time and uniquely.

If there are spaces between words in the input_term, these words should not be joined to become a new segment. For example, if you have 'cr eme' with a wrong space in it, that should still not be allowed to become 'creme'. It is just likely that the space is right more often than the errors that would appear from taking neighoured letters.

array('sonnen', 'empfindlichkeit', 'sun', 'oil', 'farb', 'palette', 'sun', 'creme')
array(['DE'], ['DE'], ['EN'], ['EN'], ['DE'], ['DE', 'EN'], ['EN'], ['DE', 'EN'])

The 'DE/EN' tag is just an optional idea to show that the word exists in German and English, you can also choose 'EN' over 'DE' in this example. The language tags are a bonus, you can also answer without that.

There is probably a fast solution that uses numpy arrays and/or dictionaries instead of lists or Dataframes, but choose as you like.

How to use many languages in symspell word segmentation and combine them to one chosen merger? The aim is a sentence of words built from all letters, using each letter once, keeping all original spaces.

questionto42 · Accepted Answer · 2022-01-01T17:52:46.313

SimSpell way

This is the recommended way. I found this out only after doing the manual way. You can easily use the same frequency logic that is used for one language for two languages instead: Just load two languages or more into the sym_spell object!

import pkg_resources
from symspellpy.symspellpy import SymSpell

input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"

# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "de-100k.txt"
)

# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

# DO NOT reset the sym_spell object at this line so that
# English is added to the German frequency dictionary
# NOT: #reset the sym_spell object
# NOT: #sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

Out:

sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884

Manual way

In this manual way, the logic is: the longer word of two languages wins, logging the winner language tag. If they are at the same length, both languages are logged.

As in the question, the input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme", using a reset object for each language segmentation in SymSpell, leads to s1 for German and s2 for English.

import numpy as np

s1 = 'sonnen empfindlichkeit s uno i l farb palette sun creme'
s2 = 'son ne ne mp find li ch k e it sun oil far b palette sun creme'

num_letters = len(s1.replace(' ',''))
list_w1 = s1.split()
list_w2 = s2.split()
list_w1_len = [len(x) for x in list_w1]
list_w2_len = [len(x) for x in list_w2]

lst_de = [(x[0], x[1], x[2], 'de', x[3], x[4]) for x in zip(list_w1, list_w1_len, range(len(list_w1)), np.cumsum([0] + [len(x)+1 for x in list_w1][:-1]), np.cumsum([0] + [len(x) for x in list_w1][:-1]))]
lst_en = [(x[0], x[1], x[2], 'en', x[3], x[4]) for x in zip(list_w2, list_w2_len, range(len(list_w2)), np.cumsum([0] + [len(x)+1 for x in list_w2][:-1]), np.cumsum([0] + [len(x) for x in list_w2][:-1]))]

idx_word_de = 0
idx_word_en = 0
lst_words = []
idx_letter = 0

# stop at num_letters-1, else you check the last word 
# also on the last idx_letter and get it twice
while idx_letter <= num_letters-1:
lst_de[idx_word_de][5], idx_letter)
    while(lst_de[idx_word_de][5]<idx_letter):
        idx_word_de +=1
    while(lst_en[idx_word_en][5]<idx_letter):
        idx_word_en +=1

    if lst_de[idx_word_de][1]>lst_en[idx_word_en][1]:
        lst_word_stats = lst_de[idx_word_de]
        str_word = lst_word_stats[0]
#         print('de:', lst_de[idx_word_de])
        idx_letter += len(str_word) #lst_de[idx_word_de][0])
    elif lst_de[idx_word_de][1]==lst_en[idx_word_en][1]:
        lst_word_stats = (lst_de[idx_word_de][0], lst_de[idx_word_de][1], (lst_de[idx_word_de][2], lst_en[idx_word_en][2]), (lst_de[idx_word_de][3], lst_en[idx_word_en][3]), (lst_de[idx_word_de][4], lst_en[idx_word_en][4]), lst_de[idx_word_de][5])
        str_word = lst_word_stats[0]
#         print('de:', lst_de[idx_word_de], 'en:', lst_en[idx_word_en])
        idx_letter += len(str_word) #lst_de[idx_word_de][0])        
    else:
        lst_word_stats = lst_en[idx_word_en]
        str_word = lst_word_stats[0]
#         print('en:', lst_en[idx_word_en][0])
        idx_letter += len(str_word)
    lst_words.append(lst_word_stats)

Out lst_words:

[('sonnen', 6, 0, 'de', 0, 0),
 ('empfindlichkeit', 15, 1, 'de', 7, 6),
 ('sun', 3, 10, 'en', 31, 21),
 ('oil', 3, 11, 'en', 35, 24),
 ('farb', 4, 6, 'de', 33, 27),
 ('palette', 7, (7, 14), ('de', 'en'), (38, 45), 31),
 ('sun', 3, (8, 15), ('de', 'en'), (46, 53), 38),
 ('creme', 5, (9, 16), ('de', 'en'), (50, 57), 41)]

Legend of the output:

chosen word | len | word_idx_of_lang | lang | letter_idx_lang_with_spaces | letter_idx_no_spaces

The best way would probably be to use language detection on whichever context your words come from, which might be more than one word in many cases and then select a dedicated SymSpell object correspondingly. — Radio Controlled, Jul 13 '22 at 09:20
@RadioControlled You are right, just choosing the languages from two txt files and use those chopped words as a language detector is not good, though mostly working here since symspell is itself already frequency-based and will therefore give you the most likely outcome. If you added higher ngrams, it would get even better. Best is to use Deep Learning for the language detection while the statistical symspell (I guess it uses TFIDF) is enough to find spelling mistakes. — questionto42, Jul 13 '22 at 17:43

How to get the best merger from symspellpy word segmentation of many languages in Python?

1 Answers1

SimSpell way

Manual way