How to segment text into sub-sentences based on enumerators?

Question

I am segmenting sentences for a text in python using nltk PunktSentenceTokenizer(). However, there are many long sentences appears in a enumerated way and I need to get the sub sentence in this case.

Example:

The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.

The required output would be :

"The api allows the user to achieve following goals aXXXXX. ", "The api allows the user to achieve following goals bXXXXX." and "The api allows the user to achieve following goals cXXXXX. "

How can I achieve this goal?

I am guessing you don't want `The required output would be` to be part of the actual output? — b3000, Aug 04 '15 at 07:42

score 1 · Accepted Answer · answered Aug 04 '15 at 07:51

1

To get the sub-sequences you could use a RegExp Tokenizer.

An example how to use it to split the sentence could look like this:

from nltk.tokenize.regexp import regexp_tokenize

str1 = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'

parts =  regexp_tokenize(str1, r'\(\w\)\s*', gaps=True)

start_of_sentence = parts.pop(0)

for part in parts:
    print(" ".join((start_of_sentence, part)))

answered Aug 04 '15 at 07:51

b3000

1,547
1
15
27

1

Ha, a question asked 15 hours ago and yet you beat me by 1 min :). Like your solution better actually! – Igor Aug 04 '15 at 07:55
Thanks. I have actually finished up using a very similar approach. – Kristina Aug 04 '15 at 19:46

score 0 · Answer 2 · answered Aug 04 '15 at 07:52

I'll just skip over the obvious question (being: "What have you tried so far?"). As you may have found out already, PunktSentenceTokenizer isn't really going to help you here, since it will leave your input sentence in one piece. The best solution depends heavily on the predictability of your input. The following will work on your example, but as you can see it relies on there being a colon and some comma's. If they're not there, it's not going to help you.

import re
from nltk import PunktSentenceTokenizer
s = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'
#sents = PunktSentenceTokenizer().tokenize(s)

p = s.split(':')
for l in p[1:]:
    i = l.split(',')
    for j in i:
        j = re.sub(r'\([a-z]\)', '', j).strip()
        print("%s: %s" % (p[0], j))

there is always a colon, but not always comma separated. Thanks anyway! — Kristina, Aug 04 '15 at 19:44

How to segment text into sub-sentences based on enumerators?

2 Answers2