I am facing problems while using nltk.tokenize.words_tokenize
in my code.
My code is as follows:
def clean_str_and_tokenise(line):
'''
STEP 1:
Remove punctuation marks from the input string and convert the entire string to lowercase
chars_to_remove = [',', '.', '"', "'", '/', '*', ',', '?', '!', '-', '\n', '“', '”', '_', '&', '\ufeff', '&', ';', ":"]
STEP 2:
Tokenize (convert the clean string into a list with each word being a separate element)
Arguments:
line: The raw text string
Returns:
list of words in lowercase without punctuations
'''
# YOUR CODE HERE
chars_to_remove = [',', '.', '"', "'", '/', '*', ',', '?', '!', '-', '\n', '“', '”', '_', '&', '\ufeff', '&', ';', ":"]
text_clean = "".join([i.lower() for i in line if i not in chars_to_remove])
print(text_clean)
return nltk.tokenize.word_tokenize(text_clean)
Using test string 1
test_str1 = 'Never, GOING* tO give. you- up?'
clean_str_and_tokenise(test_str1)
I get output as:
never going to give you up
['never', 'going', 'to', 'give', 'you', 'up']
But when I use test string 2
test_str2 = 'Never, GONNA* give. you- up?'
clean_str_and_tokenise(test_str2)
I get the following output:
never gonna give you up
['never', 'gon', 'na', 'give', 'you', 'up']
I word 'gonna' gets split around 'n'. I tried changing the strings and the error stays there. I have figured out that the error is in tokenisation because the cleaning and converting to lower case is working properly. Can someone please help explain this?
I expect that the word 'gonna' in test string 2 should not split and tokenise as a single word i.e. the output should look like
['never', 'gonna', 'give', 'you', 'up']