Word tokenization NLTK abbreviation problem

Question

I want to know how to word tokenize the following sentence (string):

"I am good. I e.g. wash the dishes."

In to the following words:

["I", "am", "good", ".", "I", "e.g.", "wash", "the", "dishes"]

Now, the problem is when it comes to abbreviations like "e.g." it is tokenized by NLTK word_tokenizer as follows ["e.g", "."]

I tried using using punkt trained with "e.g." to sentence tokenize it first but I realised that after I word tokenize it I would get the same result.

Any thoughts on how I would achieve my goal.

Note: I am rstricted to using NLTK.

Possible duplicate of https://stackoverflow.com/questions/34805790/how-to-avoid-nltks-sentence-tokenizer-splitting-on-abbreviations — amanb, Mar 16 '19 at 21:09
Not a duplicate, since the question you are referring deals with sentence tokens. My question is based on word tokens. I did look at the question you are referring to previously and tried to use some principles from there in my program but it became redundant since word tokenization step brings me back to my problem mentioned in the question. — tim, Mar 16 '19 at 21:13

amanb · Accepted Answer · 2019-03-16T22:11:32.600

The NLTK regexp_tokenize module splits a string into substrings using a regular expression. A regex pattern can be defined which will build a tokenizer that matches the groups in this pattern. We can write a pattern for your particular use-case which looks for words, abbreviations(both upper and lower case) and symbols like '.', ';' etc.

import nltk
sent = "I am good. I e.g. wash the dishes."
pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Za-z]\.)+        # abbreviations(both upper and lower case, like "e.g.", "U.S.A.")
        | \w+(?:-\w+)*        # words with optional internal hyphens 
        | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''
nltk.regexp_tokenize(sent, pattern)
#Output:
['I', 'am', 'good', '.', 'I', 'e.g.', 'wash', 'the', 'dishes', '.']

The Regex pattern for abbreviations is (?:[A-Za-z]\.)+. The \. matches the "." in a forward lookup containing characters in A-Z or a-z.

On the other hand, the full stop is matched as an independent symbol in the following pattern which is not bound to a positive or negative lookahead or containment in a set of alphabets:

'[][.,;"'?():_`-]'

If its not too much trouble, could you please explain how your regex differentiates between an end of a sentence full-stop and an end of an abbreviation full_stop. Thank you. — tim, Mar 16 '19 at 21:53

Word tokenization NLTK abbreviation problem

1 Answers1