1

This sentence is part of the Simplified Wikipedia:

There are three things in air, Nitrogen (79%), oxygen (20%), and other types of gases (1%).

The parenthetical percentages are not handled well in spaCy 2.0 and 2.1. What is the best way to handle this class of problem?

Here is the visualization: visualization of parse of above sample sentence

David Batista
  • 3,029
  • 2
  • 23
  • 42
Jack Parsons
  • 161
  • 7
  • Have you tried to tokenise yourself the sentence before and pass it to the parser, using the code/library and not only the visualisation demo? – David Batista Mar 19 '19 at 08:07

2 Answers2

1
  • Use a regex & spacy's merge/retokenize method to merge the content in parentheses as a single token.

    >>> import spacy
    >>> import re
    >>> my_str = "There are three things in air, Nitrogen (79%), oxygen (20%), and other types of gases (1%)."
    >>> nlp = spacy.load('en')
    >>> parsed = nlp(my_str)
    >>> [(x.text,x.pos_) for x in parsed]
    [('There', 'ADV'), ('are', 'VERB'), ('three', 'NUM'), ('things', 'NOUN'), ('in', 'ADP'), ('air', 'NOUN'), (',', 'PUNCT'), ('Nitrogen', 'PROPN'), ('(', 'PUNCT'), ('79', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), (',', 'PUNCT'), ('oxygen', 'NOUN'), ('(', 'PUNCT'), ('20', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), (',', 'PUNCT'), ('and', 'CCONJ'), ('other', 'ADJ'), ('types', 'NOUN'), ('of', 'ADP'), ('gases', 'NOUN'), ('(', 'PUNCT'), ('1', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), ('.', 'PUNCT')]
    
    >>> indexes = [m.span() for m in re.finditer('\([\w%]{0,5}\)',my_str,flags=re.IGNORECASE)]
    >>> indexes
    [(40, 45), (54, 59), (86, 90)]
    >>> for start,end in indexes:
    ...     parsed.merge(start_idx=start,end_idx=end)
    ...
    (79%)
    (20%)
    (1%)
    >>> [(x.text,x.pos_) for x in parsed]
    [('There', 'ADV'), ('are', 'VERB'), ('three', 'NUM'), ('things', 'NOUN'), ('in', 'ADP'), ('air', 'NOUN'), (',', 'PUNCT'), ('Nitrogen', 'PROPN'), ('(79%)', 'PUNCT'), (',', 'PUNCT'), ('oxygen', 'NOUN'), ('(20%)', 'PUNCT'), (',', 'PUNCT'), ('and', 'CCONJ'), ('other', 'ADJ'), ('types', 'NOUN'), ('of', 'ADP'), ('gases', 'NOUN'), ('(1%)', 'PUNCT'), ('.', 'PUNCT')]
    
DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
0

Initially wrote an answer on the issue tracker here, but Stack Overflow is definitely a better place for that kind of question.

I just tested you example with the latest version, and the tokenization looks like this:

['There', 'are', 'three', 'things', 'in', 'air', ',', 'Nitrogen', '(', '79', '%', ')', ',', 
'oxygen', '(', '20', '%', ')', ',', 'and', 'other', 'types', 'of', 'gases', '(', '1', '%', ')', '.']

Here's the parse tree, which looks decent to me. (If you want to try this out yourself, note that I set options={'collapse_punct': False, 'compact': True} to show all punctuation tokens separately and make the big tree easier to read.)

displacy

That said, you can probably also find a lot of edge cases and examples of where the out-of-the-box tokenization rules can't generalise for all combinations of punctuation and parentheses, or where the pre-trained parser or tagger makes an incorrect prediction. So if you're dealing with longer inserts in parentheses and the parser struggles with those, you might want to fine-tune it with more examples like that.

Looking at a single sentence in isolation isn't very helpful, because it doesn't give you a good idea of the overall accuracy on your data and what to focus on. Even if you train a fancy state-of-the-art model that gets 90% accuracy on your data, it still means that every 10th prediction it makes is wrong.

Ines Montani
  • 6,935
  • 3
  • 38
  • 53