14

NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box.

>>> nltk.word_tokenize("(Dr. Edwards is my friend.)")
['(', 'Dr.', 'Edwards', 'is', 'my', 'friend', '.', ')']

I'd like to use this same algorithm except to have it return tuples of offsets into the original string instead of string tokens.

By offset I mean 2-ples that can serve as indexes into the original string. For example here I'd have

>>> s = "(Dr. Edwards is my friend.)"
>>> s.token_spans()
[(0,1), (1,4), (5,12), (13,15), (16,18), (19,25), (25,26), (26,27)]

because s[0:1] is "(", s[1:4] is "Dr." and so forth.

Is there a single NLTK call that does this, or do I have to write my own offset arithmetic?

W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111

4 Answers4

8

Yes, most Tokenizers in nltk have a method called span_tokenize but unfortunately the Tokenizer you are using doesn't.

By default the word_tokenize function uses a TreebankWordTokenizer. The TreebankWordTokenizer implementation has a fairly robust implementation but currently it lacks an implementation for one important method, span_tokenize.

I see no implementation of span_tokenize for a TreebankWordTokenizer so I believe you will need to implement your own. Subclassing TokenizerI can make this process a little less complex.

You might find the span_tokenize method of PunktWordTokenizer useful as a starting point.

I hope this info helps.

erik-e
  • 3,721
  • 1
  • 21
  • 19
  • 1
    I whipped up a three-second version of a here: https://gist.github.com/ckoppelman/c93e4192d9f189fba590e095258f8f33. Any help or advice is appreciated – Charles Apr 28 '17 at 17:43
6

At least since NLTK 3.4 TreebankWordTokenizer supports span_tokenize:

>>> from nltk.tokenize import TreebankWordTokenizer as twt
>>> list(twt().span_tokenize('What is the airspeed of an unladen swallow ?'))
[(0, 4),
 (5, 7),
 (8, 11),
 (12, 20),
 (21, 23),
 (24, 26),
 (27, 34),
 (35, 42),
 (43, 44)]
Fibo Kowalsky
  • 1,198
  • 12
  • 23
0

pytokenizations have a useful function get_original_spans to get the spans:

# $ pip install pytokenizations
import tokenizations
text = "(Dr. Edwards is my friend.)"
tokens = nltk.word_tokenize(text)
tokenizations.get_original_spans(tokens, text)
>>> [(0,1), (1,4), (5,12), (13,15), (16,18), (19,25), (25,26), (26,27)]

See the documentation for other useful functions.

tamuhey
  • 2,904
  • 3
  • 21
  • 50
0

NLTK version 3.5 's TreebankWordDetokenizer supports the function span_tokenize() so there is no need to write an own offset arithmetic anymore:

>>> from nltk.tokenize import TreebankWordTokenizer
>>> s = '''Good muffins cost $3.88\\nin New (York).  Please (buy) me\\ntwo of them.\\n(Thanks).'''
>>> expected = [(0, 4), (5, 12), (13, 17), (18, 19), (19, 23),
... (24, 26), (27, 30), (31, 32), (32, 36), (36, 37), (37, 38),
... (40, 46), (47, 48), (48, 51), (51, 52), (53, 55), (56, 59),
... (60, 62), (63, 68), (69, 70), (70, 76), (76, 77), (77, 78)]
>>> list(TreebankWordTokenizer().span_tokenize(s)) == expected
True
>>> expected = ['Good', 'muffins', 'cost', '$', '3.88', 'in',
... 'New', '(', 'York', ')', '.', 'Please', '(', 'buy', ')',
... 'me', 'two', 'of', 'them.', '(', 'Thanks', ')', '.']
>>> [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected
True
Joe
  • 1
  • 1
    Welcome to Stack Overflow. Before answering an already answered question (green check-mark) it's best to check whether your answer duplicates any of the existing answers. Your answer here seems to duplicate the [accepted answer](https://stackoverflow.com/a/55787917/5698098) from Fibo Kowalsky. Duplication is rarely something good; and may be seen as unwanted distraction on Stack Overflow (on which you may get down-votes). See also [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer); and take the [tour](https://stackoverflow.com/tour) to learn how Stack Overflow works. – Ivo Mori Aug 29 '20 at 14:36