2

I am working on a project that requires the separating of a superscript from its root word so that it can be tokenized as a separate token.

if I tokenize "This is a sentence about testString™" the results will be ["this, "is", "a", "sentence", "about", "testString™"] and I need to get to a solution where the output can look like this ["this, "is", "a", "sentence", "about", "testString", "™"]

I am able to isolate the ™ by looking at characters that have a unicode > 127 since that is the highest ascii value, and superscripts do not fall within that range

    text = "This is a sentence about testString™"
    text.join([i for i in text if ord(i) > 127])
    # result -> "™"

How can I join this back into the text so that the results will be: "This is a sentence about testString ™"

Its really easy to just eliminate the superscripts using this logic, but I need to find a way to actually add it back to the string after the root word

If there are any more simple ideas as to how to tokenize these superscripts I'm certainly open to other suggestions. I could not find any way other that looking at the unicode values to separate them from the root, then applying tokenizer to the new string

alex
  • 31
  • 4
  • Q: Would [isalnum()](https://docs.python.org/3/library/stdtypes.html) adequately discriminate all the "superscript" characters you have in mind? – paulsm4 Nov 09 '22 at 21:23
  • @paulsm4 I'm more looking for a way to join the subscript back to the original string, just separate from the root word so I don't think isalnum would help? Do you have an example of it in mind? – alex Nov 09 '22 at 21:37

1 Answers1

1

Here is a solution that handles every case I could think to throw at it.

The code loops over each character in a word and detects whether its unicode value is greater than 127, if it is, it detects the original position of the character and then adds appropriate white space such that the split() function can be used later to separate multiple special characters throughout the word.

In the first part of your question you want the output in list form, with each item as a token. Later you request the output to be in string form, with special characters (superscripts) as a separate token in the string. The solution below contains the option to do either: just comment/uncomment the lines which use temp_text_lst or temp_text_str depending on whether you want a tokenised list output, or just a string.

Solution

texts = [
    "Sentence testString™", "Sentence ™testString", "Sentence test™String",
    "Sentence testString™™", "Sentence ™™testString", "Sentence test™™String",
    "Sentence testString™™™", "Sentence ™™™testString", "Sentence test™™™String",
    "Sentence ™test™String", "Sentence test™String™", "Sentence ™test™String™",
    "Sentence ᵟteᶿstStr™™ingᵝ", "™ testString"
]

processed_texts = []

for text in texts:
    # Comment/uncomment occurrences of 'temp_text_lst' / 'temp_text_str'
    # throughout this loop to change output format
    temp_text_lst = []
    # temp_text_str = ""
    for word in text.split():
        temp_word = ""
        found_non_alpha_num = 0
        for i, char in enumerate(word):
            # Condition to detect & separate special chars attached to word
            if ord(char) > 127:
                if i == 0:
                    temp_word = char + " " + temp_word
                elif i == len(word) - 1:
                    temp_word = temp_word + " " + char
                else:
                    temp_word = temp_word + " " + char + " "
                found_non_alpha_num = 1
            else:
                temp_word += char
        if found_non_alpha_num:
            for token in temp_word.split():
                temp_text_lst.append(token)
                # temp_text_str = temp_text_str + token + " "
        else:
            temp_text_lst.append(temp_word)
            # temp_text_str = temp_text_str + temp_word + " "
    # Append processed text
    processed_texts.append(temp_text_lst)
    # Append processed text without the redundant white space added to the end of the last token
    # processed_texts.append(temp_text_str[:-1])

for text in processed_texts:
    print(text)

Output (list format)

['Sentence', 'testString', '™']
['Sentence', '™', 'testString']
['Sentence', 'test', '™', 'String']
['Sentence', 'testString', '™', '™']
['Sentence', '™', '™', 'testString']
['Sentence', 'test', '™', '™', 'String']
['Sentence', 'testString', '™', '™', '™']
['Sentence', '™', '™', '™', 'testString']
['Sentence', 'test', '™', '™', '™', 'String']
['Sentence', '™', 'test', '™', 'String']
['Sentence', 'test', '™', 'String', '™']
['Sentence', '™', 'test', '™', 'String', '™']
['Sentence', 'ᵟ', 'te', 'ᶿ', 'stStr', '™', '™', 'ing', 'ᵝ']
['™', 'testString']

Output (string format)

Sentence testString ™
Sentence ™ testString
Sentence test ™ String
Sentence testString ™ ™
Sentence ™ ™ testString
Sentence test ™ ™ String
Sentence testString ™ ™ ™
Sentence ™ ™ ™ testString
Sentence test ™ ™ ™ String
Sentence ™ test ™ String
Sentence test ™ String ™
Sentence ™ test ™ String ™
Sentence ᵟ te ᶿ stStr ™ ™ ing ᵝ
™ testString
Kyle F Hartzenberg
  • 2,567
  • 3
  • 6
  • 24
  • @alex You're welcome! Don't forget to either upvote if the answer if it was useful, or upvote and accept answers that answer your question. – Kyle F Hartzenberg Nov 19 '22 at 03:21