0

I am segmenting sentences for a text in python using nltk PunktSentenceTokenizer(). However, there are many long sentences appears in a enumerated way and I need to get the sub sentence in this case.

Example:

The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX. 

The required output would be :

"The api allows the user to achieve following goals aXXXXX. ", "The api allows the user to achieve following goals bXXXXX." and "The api allows the user to achieve following goals cXXXXX. "

How can I achieve this goal?

Kristina
  • 3
  • 2

2 Answers2

1

To get the sub-sequences you could use a RegExp Tokenizer.

An example how to use it to split the sentence could look like this:

from nltk.tokenize.regexp import regexp_tokenize

str1 = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'

parts =  regexp_tokenize(str1, r'\(\w\)\s*', gaps=True)

start_of_sentence = parts.pop(0)

for part in parts:
    print(" ".join((start_of_sentence, part)))
b3000
  • 1,547
  • 1
  • 15
  • 27
0

I'll just skip over the obvious question (being: "What have you tried so far?"). As you may have found out already, PunktSentenceTokenizer isn't really going to help you here, since it will leave your input sentence in one piece. The best solution depends heavily on the predictability of your input. The following will work on your example, but as you can see it relies on there being a colon and some comma's. If they're not there, it's not going to help you.

import re
from nltk import PunktSentenceTokenizer
s = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'
#sents = PunktSentenceTokenizer().tokenize(s)

p = s.split(':')
for l in p[1:]:
    i = l.split(',')
    for j in i:
        j = re.sub(r'\([a-z]\)', '', j).strip()
        print("%s: %s" % (p[0], j))
Igor
  • 1,251
  • 10
  • 21