About parsing parentheses in English model

Question

This sentence is part of the Simplified Wikipedia:

There are three things in air, Nitrogen (79%), oxygen (20%), and other types of gases (1%).

The parenthetical percentages are not handled well in spaCy 2.0 and 2.1. What is the best way to handle this class of problem?

Here is the visualization:

Have you tried to tokenise yourself the sentence before and pass it to the parser, using the code/library and not only the visualisation demo? — David Batista, Mar 19 '19 at 08:07

score 1 · Answer 1 · answered Mar 19 '19 at 10:57

Use a regex & spacy's merge/retokenize method to merge the content in parentheses as a single token.

>>> import spacy
>>> import re
>>> my_str = "There are three things in air, Nitrogen (79%), oxygen (20%), and other types of gases (1%)."
>>> nlp = spacy.load('en')
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[('There', 'ADV'), ('are', 'VERB'), ('three', 'NUM'), ('things', 'NOUN'), ('in', 'ADP'), ('air', 'NOUN'), (',', 'PUNCT'), ('Nitrogen', 'PROPN'), ('(', 'PUNCT'), ('79', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), (',', 'PUNCT'), ('oxygen', 'NOUN'), ('(', 'PUNCT'), ('20', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), (',', 'PUNCT'), ('and', 'CCONJ'), ('other', 'ADJ'), ('types', 'NOUN'), ('of', 'ADP'), ('gases', 'NOUN'), ('(', 'PUNCT'), ('1', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), ('.', 'PUNCT')]

>>> indexes = [m.span() for m in re.finditer('\([\w%]{0,5}\)',my_str,flags=re.IGNORECASE)]
>>> indexes
[(40, 45), (54, 59), (86, 90)]
>>> for start,end in indexes:
...     parsed.merge(start_idx=start,end_idx=end)
...
(79%)
(20%)
(1%)
>>> [(x.text,x.pos_) for x in parsed]
[('There', 'ADV'), ('are', 'VERB'), ('three', 'NUM'), ('things', 'NOUN'), ('in', 'ADP'), ('air', 'NOUN'), (',', 'PUNCT'), ('Nitrogen', 'PROPN'), ('(79%)', 'PUNCT'), (',', 'PUNCT'), ('oxygen', 'NOUN'), ('(20%)', 'PUNCT'), (',', 'PUNCT'), ('and', 'CCONJ'), ('other', 'ADJ'), ('types', 'NOUN'), ('of', 'ADP'), ('gases', 'NOUN'), ('(1%)', 'PUNCT'), ('.', 'PUNCT')]

Thanks for this code. It does help as a sample of spaCy coding. Also, what version of spaCy did you use for this? — Jack Parsons, Mar 21 '19 at 06:00

score 0 · Answer 2 · answered Mar 21 '19 at 13:08

Initially wrote an answer on the issue tracker here, but Stack Overflow is definitely a better place for that kind of question.

I just tested you example with the latest version, and the tokenization looks like this:

['There', 'are', 'three', 'things', 'in', 'air', ',', 'Nitrogen', '(', '79', '%', ')', ',', 
'oxygen', '(', '20', '%', ')', ',', 'and', 'other', 'types', 'of', 'gases', '(', '1', '%', ')', '.']

Here's the parse tree, which looks decent to me. (If you want to try this out yourself, note that I set options={'collapse_punct': False, 'compact': True} to show all punctuation tokens separately and make the big tree easier to read.)

displacy

That said, you can probably also find a lot of edge cases and examples of where the out-of-the-box tokenization rules can't generalise for all combinations of punctuation and parentheses, or where the pre-trained parser or tagger makes an incorrect prediction. So if you're dealing with longer inserts in parentheses and the parser struggles with those, you might want to fine-tune it with more examples like that.

Looking at a single sentence in isolation isn't very helpful, because it doesn't give you a good idea of the overall accuracy on your data and what to focus on. Even if you train a fancy state-of-the-art model that gets 90% accuracy on your data, it still means that every 10th prediction it makes is wrong.

How can I add these `options` to the spacy tokenizer? – MBT Feb 10 '20 at 12:41 — MBT, Feb 10 '20 at 12:41

About parsing parentheses in English model

2 Answers2