Determine if a text extract from spacy is a complete sentence

Question

We are working on sentences extracted from a PDF. The problem is that it includes the title, footers, table of contents, etc. Is there a way to determine if the sentence we get when pass the document to spacy is a complete sentence. Is there a way to filter parts of sentences like titles?

can you give examples? You could check if certain Part-of-Speech tags are present (a title might not have a verb). You could also check upper/lower-case counts and if the "sentence" ends with a dot or not. — Jérôme Bau, May 23 '18 at 07:01
as an example I get things like "I. Introduction" in a table of contents. The title could be anything. — CrabbyPete, May 24 '18 at 20:06

tkr · Answer 1 · 2020-02-11T09:49:26.943

4

A complete sentence contains at least one subject, one predicate, one object, and closes with punctuation. Subject and object are almost always nouns, and the predicate is always a verb.

Thus you need to check if your sentence contains two nouns, one verb and closes with punctuation:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I. Introduction\nAlfred likes apples! A car runs over a red light.")
for sent in doc.sents:
    if sent[0].is_title and sent[-1].is_punct:
        has_noun = 2
        has_verb = 1
        for token in sent:
            if token.pos_ in ["NOUN", "PROPN", "PRON"]:
                has_noun -= 1
            elif token.pos_ == "VERB":
                has_verb -= 1
         if has_noun < 1 and has_verb < 1:
             print(sent.string.strip())

Update

I also would advise to check if the sentence starts with an upper case letter, I added the modification in the code. Furthermore, I would like to point out that what I wrote is true for English and German, I don't know how it is in other languages.

edited Feb 11 '20 at 09:49

answered Feb 10 '20 at 11:43

tkr

81
4

2

This will work with simple sentences, but a phrase like `'Apples and pears that taste nice.'` will be incorrectly identified as a complete sentence. – Garrett Apr 13 '21 at 02:30
Yes, but it is not about "simple" sentences, but about **correct** sentences. _Garbage in -> Garbage out_. Thus, it works as intended! It is not supposed to identify wrong grammar. – tkr Aug 06 '21 at 08:33
The assumption that "a complete sentence contains at least one subject, one predicate, one object" is wrong. This solution fails for many sentences, such as "The dog is sleeping." and "I understand." – Jerry K. Oct 28 '21 at 10:47
Indeed, it is a trade off. You can simply change `has_noun = 2` to `has_noun = 1` and your examples are covered. You will also get a lot of garbage though. – tkr Nov 11 '21 at 18:39

score 0 · Answer 2 · answered Nov 09 '18 at 15:42

0

Try looking for the first noun chunk in each sentence. That is usually (but not always) is the title subject of the sentence.

sentence_title = [chunk.text for chunk in doc.noun_chunks][0]

answered Nov 09 '18 at 15:42

reka18

7,440
5
16
37

score 0 · Answer 3 · answered May 12 '21 at 07:20

You can perform sentence segmentation using trainable pipeline component in Spacy. https://spacy.io/api/sentencerecognizer

Additionally, if you can come up with some pattern in the text string then use python regex lib re https://docs.python.org/3/library/re.html.

Determine if a text extract from spacy is a complete sentence

3 Answers3

Linked