Error in NLTK tokenisation (words with two consequent same letters gets split)

Question

I am facing problems while using nltk.tokenize.words_tokenize in my code.

My code is as follows:

def clean_str_and_tokenise(line):
    '''
        STEP 1:
            Remove punctuation marks from the input string and convert the entire string to lowercase
            chars_to_remove = [',', '.', '"', "'", '/', '*', ',', '?', '!', '-', '\n', '“', '”', '_', '&', '\ufeff', '&', ';', ":"]
        STEP 2:
            Tokenize (convert the clean string into a list with each word being a separate element)
        
        Arguments:
            line: The raw text string

        Returns:
            list of words in lowercase without punctuations
    '''
    # YOUR CODE HERE
    chars_to_remove = [',', '.', '"', "'", '/', '*', ',', '?', '!', '-', '\n', '“', '”', '_', '&', '\ufeff', '&', ';', ":"]
    
    text_clean = "".join([i.lower() for i in line if i not in chars_to_remove])
    print(text_clean)
    
    return nltk.tokenize.word_tokenize(text_clean)

Using test string 1

test_str1 = 'Never, GOING* tO give. you- up?'
clean_str_and_tokenise(test_str1)

I get output as:

never going to give you up
['never', 'going', 'to', 'give', 'you', 'up']

But when I use test string 2

test_str2 = 'Never, GONNA* give. you- up?'
clean_str_and_tokenise(test_str2)

I get the following output:

never gonna give you up
['never', 'gon', 'na', 'give', 'you', 'up']

I word 'gonna' gets split around 'n'. I tried changing the strings and the error stays there. I have figured out that the error is in tokenisation because the cleaning and converting to lower case is working properly. Can someone please help explain this?

I expect that the word 'gonna' in test string 2 should not split and tokenise as a single word i.e. the output should look like

['never', 'gonna', 'give', 'you', 'up']

score 1 · Answer 1 · answered Oct 28 '22 at 11:37

You have a confusion about word segementation (tokenization): a tokenizer like the nltk tokenizer is trained to detect the boundaries between lexical units, i.e. not only words but any other punctuation sign, etc. Therefore:

In general there is no reason to remove the punctuation signs manually before tokenizing: the nltk tokenizer is able to properly separate them from the words, and they are potentially useful for the semantic of the text. Additionally your method is going to cause some errors.
The split of 'gonna' is not an error (and is not related to two consecutive identical letters): it's because 'gonna' is the shortened form of 'going to', which is made of two lexical units. For the same reason 'isn't' is usually decomposed into two tokens.

from nltk.tokenize import word_tokenize
text = "Never, GONNA* give. you- up? Additionally, it's neat, isn't it?"
print(word_tokenize(text))

Result:

['Never', ',', 'GON', 'NA', '*', 'give', '.', 'you-', 'up', '?', 'Additionally', ',', 'it', "'s", 'neat', ',', 'is', "n't", 'it', '?']

So, in order to obtain each word as a member of the list, what tokenizer should I use, if at all? I cannot use 'string' or 're' to split or tokenize. — avg-bitsian, Oct 28 '22 at 11:54
@avg-bitsian it depends what is your goal: if you want to process the text as accurately as possible by NLP stanfdards, you should use the nltk tokenizer and keep the punctuation. If you want to obtain specifically the result you expected, you should simply not use an advanced tokenizer, you can just split the words using whitespaces. You could also use the tokenizers and remove tokens made only of punctuation signs afterwards (I'm not sure why, but you could). — Erwan, Oct 28 '22 at 16:17

Kyle F Hartzenberg · Answer 2 · 2022-10-28T13:13:17.023

Erwan's answer is spot on. To provide an alternative solution (given that that you cannot use string.split() or re.split(), and the results from word_tokenizer are unacceptable), you can do it the old fashioned way and iterate over the characters and save (append) those chunks of characters that precede white space, ignoring special characters from your list as you go along. This approach ends up being ~8x faster than using your original function.

Solution

import nltk
import time


def clean_str_and_tokenise(line):
    chars_to_remove = [',', '.', '"', "'", '/', '*', ',', '?', '!', '-', '\n', '“', '”', '_', '&', '\ufeff', '&', ';', ":"]
    text_clean = "".join([i.lower() for i in line if i not in chars_to_remove])
    return nltk.tokenize.word_tokenize(text_clean)


def clean_str_and_tokenise_v2(string):
    chars_to_remove = [',', '.', '"', "'", '/', '*', ',', '?', '!', '-', '\n', '“', '”', '_', '&', '\ufeff', '&', ';', ":"]
    cleaned_words = []
    temp = ""
    for char in string:
        if char == " ":
            cleaned_words.append(temp)
            temp = ""
        else:
            if char not in chars_to_remove:
                temp += char.lower()
    cleaned_words.append(temp)
    return cleaned_words


test_str1 = 'Never, GOING* tO give. you- up?'
test_str2 = 'Never, GONNA* give. you- up?'


t0 = time.time()
for i in range(1000):
    clean_str1 = clean_str_and_tokenise(test_str1)
t1 = time.time()
print("Original function duration: {} ms".format((t1-t0)*10**3), clean_str1)
# Original function duration: 94.02060508728027 ms ['never', 'going', 'to', 'give', 'you', 'up']

t0 = time.time()
for i in range(1000):
    clean_str2 = clean_str_and_tokenise(test_str2)
t1 = time.time()
print("Original function duration: {} ms".format((t1-t0)*10**3), clean_str2)
# Original function duration: 83.01877975463867 ms ['never', 'gon', 'na', 'give', 'you', 'up']

t0 = time.time()
for i in range(1000):
    clean_str1_v2 = clean_str_and_tokenise_v2(test_str1)
t1 = time.time()
print("New function duration: {} ms".format((t1-t0)*10**3), clean_str1_v2)
# New function duration: 10.001897811889648 ms ['never', 'going', 'to', 'give', 'you', 'up']

t0 = time.time()
for i in range(1000):
    clean_str2_v2 = clean_str_and_tokenise_v2(test_str2)
t1 = time.time()
print("New function duration: {} ms".format((t1-t0)*10**3), clean_str2_v2)
# New function duration: 9.003162384033203 ms ['never', 'gonna', 'give', 'you', 'up']

Error in NLTK tokenisation (words with two consequent same letters gets split)

2 Answers2

Solution