7

I am looking for a sentence segmentor that can split compound sentences into simple sentences.

Example:

Input: Andrea is beautiful but she is strict.
(expected) Output: Andrea is beautiful. she is strict.

Input: i am andrea and i work for google. 
(expected) Output: i am andrea. i work for google.

Input: Italy is my favorite country; i plan to spend two weeks there next year.
(expected) Output: Italy is my favorite country. i plan to spend two weeks there next year.

Any recommendations? I tried NLTK, spacy, segtok, nlp-compromise but they don't work on these complex examples (I understand this is a difficult problem, thus no easy solutions).

Amir
  • 1,926
  • 3
  • 23
  • 40
Anuj Gupta
  • 6,328
  • 7
  • 36
  • 55
  • I guess, it's not a simple tokenization task, and you should try a dependency grammar syntax parser (like SyntaxNet) that will identify where are the simple sentences in your compound sentence and what word connects them. Then you can just replace this word with a dot. – Amir Jun 19 '17 at 10:03
  • Can you provide more details about what you have tried already? – mbatchkarov Jun 19 '17 at 11:08
  • Please search the term of "paraphrase" in https://scholar.google.com/ . – hiropon Oct 28 '20 at 02:23

2 Answers2

3

First of all, you need to better define what a "simple sentence" means to you from a linguistic (grammar) perspective. You can say, for example, that simple sentence are:

  • just text without punctuation in the middle (periods, commas, colons, etc)
  • those with a single verb. In that case you will deal with hierarchy where a sentence is "completed" by reusing another.
  • a phrase-like text, where conjunctions can act as delimiters too.

In short, you have many alternative for defining this, and depending on your need your "rule" should be more (or less) rigorous because it will impact your algorithm design and (of course) your output.

I would suggest you 2 basic instructions

  1. split by punctuation, so you will have "simpler sentences" (e.g. your input3)
  2. input each of those to a dependency parser such as Spacy, and take advantage of the dependency links as delimiters.

Demo using your provided examples:
Spacy output these trees input1 and input2. You may notice that using conj as delimiter and merging the remaining subtrees, it returns the output you expected. You can do the same for your input3 after split by punctuation as I mentioned above.

Finally, this is not a straightforward task, you may be fine with these simple rules, but if you need better results first improve your definitions about what a "compound' or "simple" sentence means, and have a look at more sophisticated algorithms using Machine Learning.

Although a very old question, it would be nice to know if this helps :)

Jason Angel
  • 2,233
  • 1
  • 14
  • 14
1

You can use a transformer model, such as flax-community/t5-base-wikisplit from Hugging Face.

import nltk
from nltk.tokenize import sent_tokenize
from transformers import T5Tokenizer, T5ForConditionalGeneration
checkpoint = "flax-community/t5-base-wikisplit"
tokenizer = T5Tokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint)

text = """ 
Andrea is beautiful but she is strict.
I am andrea and I work for google.
Italy is my favorite country; i plan to spend two weeks there next year.
"""

complex_sentences = sent_tokenize(text)

encoder_max_length=256
decoder_max_length=256
complex_tokenized = tokenizer(complex_sentences, 
                           padding="max_length", 
                           truncation=True, 
                           max_length=encoder_max_length, 
                           return_tensors='pt')
simple_tokenized = model.generate(complex_tokenized['input_ids'], attention_mask = complex_tokenized['attention_mask'], max_length=256, num_beams=5)
simple_sentences = tokenizer.batch_decode(simple_tokenized, skip_special_tokens=True)

print("\n\n".join(simple_sentences))

Output on the given examples:

Andrea is beautiful. But she is strict.
I am andrea. I work for google.
Italy is my favorite country. I plan to spend two weeks there next year.
unikei
  • 21
  • 2