The following code uses SymSpell in Python, see the symspellpy guide on word_segmentation.
It uses "de-100k.txt" and "en-80k.txt" frequency dictionaries from a github repo, you need to save them in your working directory. As long as you do not want to use any SymSpell logic, you do not need to install and run this script to answer the question, take just the output of the two language's word segmentations and go on.
import pkg_resources
from symspellpy.symspellpy import SymSpell
input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"
# German:
# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "de-100k.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")
# English:
# Reset the sym_spell object
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")
Out:
sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884
The aim is to find out the most relevant words by some logic: most frequent ngram neighours and/or word frequency, longest word, and the like. The logic is free of choice.
In this example with two languages, the two outputs need to be compared so that only the best segments are kept while dropping the rest, without interceptions of parts of words. In the outcome, each letter is used one time and uniquely.
If there are spaces between words in the input_term, these words should not be joined to become a new segment. For example, if you have 'cr eme' with a wrong space in it, that should still not be allowed to become 'creme'. It is just likely that the space is right more often than the errors that would appear from taking neighoured letters.
array('sonnen', 'empfindlichkeit', 'sun', 'oil', 'farb', 'palette', 'sun', 'creme')
array(['DE'], ['DE'], ['EN'], ['EN'], ['DE'], ['DE', 'EN'], ['EN'], ['DE', 'EN'])
The 'DE/EN' tag is just an optional idea to show that the word exists in German and English, you can also choose 'EN' over 'DE' in this example. The language tags are a bonus, you can also answer without that.
There is probably a fast solution that uses numpy
arrays and/or dictionaries
instead of lists
or Dataframes
, but choose as you like.
How to use many languages in symspell word segmentation and combine them to one chosen merger? The aim is a sentence of words built from all letters, using each letter once, keeping all original spaces.