We are working on sentences extracted from a PDF. The problem is that it includes the title, footers, table of contents, etc. Is there a way to determine if the sentence we get when pass the document to spacy is a complete sentence. Is there a way to filter parts of sentences like titles?
-
can you give examples? You could check if certain Part-of-Speech tags are present (a title might not have a verb). You could also check upper/lower-case counts and if the "sentence" ends with a dot or not. – Jérôme Bau May 23 '18 at 07:01
-
as an example I get things like "I. Introduction" in a table of contents. The title could be anything. – CrabbyPete May 24 '18 at 20:06
-
1Did you ever find an answer to your question? – sudo Sep 09 '18 at 01:24
3 Answers
A complete sentence contains at least one subject, one predicate, one object, and closes with punctuation. Subject and object are almost always nouns, and the predicate is always a verb.
Thus you need to check if your sentence contains two nouns, one verb and closes with punctuation:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I. Introduction\nAlfred likes apples! A car runs over a red light.")
for sent in doc.sents:
if sent[0].is_title and sent[-1].is_punct:
has_noun = 2
has_verb = 1
for token in sent:
if token.pos_ in ["NOUN", "PROPN", "PRON"]:
has_noun -= 1
elif token.pos_ == "VERB":
has_verb -= 1
if has_noun < 1 and has_verb < 1:
print(sent.string.strip())
Update
I also would advise to check if the sentence starts with an upper case letter, I added the modification in the code. Furthermore, I would like to point out that what I wrote is true for English and German, I don't know how it is in other languages.

- 81
- 4
-
2This will work with simple sentences, but a phrase like `'Apples and pears that taste nice.'` will be incorrectly identified as a complete sentence. – Garrett Apr 13 '21 at 02:30
-
Yes, but it is not about "simple" sentences, but about **correct** sentences. _Garbage in -> Garbage out_. Thus, it works as intended! It is not supposed to identify wrong grammar. – tkr Aug 06 '21 at 08:33
-
The assumption that "a complete sentence contains at least one subject, one predicate, one object" is wrong. This solution fails for many sentences, such as "The dog is sleeping." and "I understand." – Jerry K. Oct 28 '21 at 10:47
-
Indeed, it is a trade off. You can simply change `has_noun = 2` to `has_noun = 1` and your examples are covered. You will also get a lot of garbage though. – tkr Nov 11 '21 at 18:39
Try looking for the first noun chunk in each sentence. That is usually (but not always) is the title subject of the sentence.
sentence_title = [chunk.text for chunk in doc.noun_chunks][0]

- 7,440
- 5
- 16
- 37
You can perform sentence segmentation using trainable pipeline component in Spacy. https://spacy.io/api/sentencerecognizer
Additionally, if you can come up with some pattern in the text string then use python regex
lib re
https://docs.python.org/3/library/re.html.

- 1,302
- 2
- 10
- 28