5

I am using SpaCy to lemmatize text, but in some special cases I need to keep original text and just convert plural nouns to their singular forms. Is there a way to tell SpaCy to only convert plural nouns to singulars without lemmatizing the whole text (like removing ed, ing...etc) ? Or should I explicitly test each token to check if it is a plural noun to take its lemma?

P.S. Input text is dynamic, so I don't know in advance if the word is a noun or not

Thanks

Nina
  • 508
  • 4
  • 21
  • 2
    You'll have to do this somewhat manually. Look at the `tag_` field for each word/token and only lemmatize it if it's a `NNS` or `NNPS`. The full list of tags can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) – bivouac0 Feb 19 '20 at 20:11
  • Okay. Please, post your comment as an answer so that I mark it as a correct answer. – Nina Feb 20 '20 at 17:34
  • But that also involves adjusting verb forms, adding determiners, like in `Apples were there` => `An apple was there`, doesn't it? – Wiktor Stribiżew Feb 20 '20 at 23:24
  • @WiktorStribiżew : In my case I am lemmatizing ontology concepts, so I just want to process nouns. for example I want "inverted indices" to become: "inverted index" , not "invert index" – Nina Feb 21 '20 at 12:32

2 Answers2

7

Thanks to bivouac0's comment. I checked tag_ field of each token and retrieved lemma of tokens being tagged as 'NNS' or 'NNPS'

processed_text = nlp(original_text)
lemma_tags = {"NNS", "NNPS"}
for token in processed_text:
   lemma = token.text
   if token.tag_ in lemma_tags:
      lemma = token.lemma_
   ...
   # rest of code
   ...
   ...
Nina
  • 508
  • 4
  • 21
4

You cannot convert plural nouns to singular nouns using spacy. You can check whether the token is a plural noun or a singular noun.

If the token's tag is equal to 'NNS', check that token in a dictionary and get the singular form of that token.

  • Not true. The lemma of a noun is its singular form and SpaCy provides lemmatization. In addition, the `pos` field is the Universal Dependencies tag for the token and does not contain info on plural/singular state. The `tag` field gives the Penn Treebank tag, which does contain this information. – bivouac0 Feb 22 '20 at 14:28
  • 1
    Lemmatization is not the correct way of converting plural nouns to singular nouns. For example, the singular form of "radii" is "radius" but the spacy lemmatization shows "radii" as a lemma. Similarly, the singular form of "bacteria" is "bacterium" but the spacy lemmatization shows "bacteria" as a lemma. So it is better to use dictionary. – Anisha Mohandass Feb 24 '20 at 04:50
  • And yes, you are right about the tag. The token's tag (NNS, NNPS) should be checked not pos. – Anisha Mohandass Feb 24 '20 at 04:56
  • The lemma of a noun is, by definition, its singular form. Unfortunately the SpaCy lemmatizer doesn't work very well, hence the errors you mention above. – bivouac0 Feb 24 '20 at 12:24
  • Yes, the spacy lemmatizer doesn't work very well that is why I mentioned using a dictionary is better to avoid such cases. If the post owner will not face those cases then using lemmatizer is good, like @bivouac0 said. – Anisha Mohandass Feb 24 '20 at 12:52