How to count the number of plurals and singulars given a corpus in Python

Question

I hope you can help me with a task.

I need to count the number of plurals and singulars in a corpus. I have a corpus whose lines have the following structure:

['4', 'lanzas', 'lanza', 'NCFP000']

the first position [0] counts for a number (4), the second[1] counts for a form (lanzas), the third position[2] counts for a lemma (lanza) and the fourth position[3] counts for a category (NCFP000) for instance, a verb, noun, etc. So in this corpus, each word is structured according its lemma and category and the category gives us the information if a word is singular, plural, masculine or feminine.

Here are some examples of lines from the corpus:

['1', 'CargÃ³', 'cargar', 'VMIS3S0']

['2', 'el', 'el', 'DA0MS0']

['3', 'camiÃ³n', 'camiÃ³n', 'NCMS000']

['4', 'con', 'con', 'SP']

['5', 'los', 'el', 'DA0MP0']

['6', 'trastos', 'trasto', 'NCMP000']

['7', 'mÃ¡s', 'mÃ¡s', 'RG']

['8', 'pesados', 'pesado', 'AQ0MP00']

['9', '.', '.', 'Fp']

So, as you can see, the last position [3] accounts for the category of the word so AQ0MP00 means that the word is a plural and an adjective.

My question is how can I count the number of plurals and singulars in this situation? concretely, I need to count the following categories (NCFS000, NCFP000, NCMS000 and NCMP000 which stand for plural, singular, feminine and masculine) found in the whole corpus.

So far I have tried this:

corpus=open('F:/python/corpus-morf.txt','r')

text=open('F:/python/deberes.txt','r')

lines=corpus.readlines()

for i in lines:

lista=i.split()

#print(lista)

p=len(lista)

if p >0:

    forma=lista[1].rstrip()

    lema=lista[2].rstrip()

    categoria=lista[3].rstrip()

    aa=[forma,lema,categoria]

and I'm stuck here.

Do you have any ideas? I sincerely apreciate your help.

Welcome to SO. What have you tried so far codewise? Where exactly are you stuck? — petezurich, Nov 01 '18 at 11:24

score 0 · Answer 1 · answered Nov 01 '18 at 11:34

Here's one approach - note this counts all categories, so you then need to filter on the resulting dictionary for only the ones you care about:

from collections import Counter

corpus = [
  ['1', 'CargÃ³', 'cargar', 'VMIS3S0'],
  ['2', 'el', 'el', 'DA0MS0'],
  ['3', 'camiÃ³n', 'camiÃ³n', 'NCMS000'],
  ['4', 'con', 'con', 'SP'],
  ['5', 'los', 'el', 'DA0MP0'],
  ['6', 'trastos', 'trasto', 'NCMP000'],
  ['7', 'mÃ¡s', 'mÃ¡s', 'RG'],
  ['8', 'pesados', 'pesado', 'AQ0MP00'],
]

print(Counter(x[3] for x in corpus))

Counter({'VMIS3S0': 1, 'DA0MS0': 1, 'NCMS000': 1, 'SP': 1, 'DA0MP0': 1, 'NCMP000': 1, 'RG': 1, 'AQ0MP00': 1})

How to count the number of plurals and singulars given a corpus in Python

1 Answers1