I hope you can help me with a task.
I need to count the number of plurals and singulars in a corpus. I have a corpus whose lines have the following structure:
['4', 'lanzas', 'lanza', 'NCFP000']
the first position [0] counts for a number (4), the second[1] counts for a form (lanzas), the third position[2] counts for a lemma (lanza) and the fourth position[3] counts for a category (NCFP000) for instance, a verb, noun, etc. So in this corpus, each word is structured according its lemma and category and the category gives us the information if a word is singular, plural, masculine or feminine.
Here are some examples of lines from the corpus:
['1', 'Cargó', 'cargar', 'VMIS3S0']
['2', 'el', 'el', 'DA0MS0']
['3', 'camión', 'camión', 'NCMS000']
['4', 'con', 'con', 'SP']
['5', 'los', 'el', 'DA0MP0']
['6', 'trastos', 'trasto', 'NCMP000']
['7', 'más', 'más', 'RG']
['8', 'pesados', 'pesado', 'AQ0MP00']
['9', '.', '.', 'Fp']
So, as you can see, the last position [3] accounts for the category of the word so AQ0MP00 means that the word is a plural and an adjective.
My question is how can I count the number of plurals and singulars in this situation? concretely, I need to count the following categories (NCFS000, NCFP000, NCMS000 and NCMP000 which stand for plural, singular, feminine and masculine) found in the whole corpus.
So far I have tried this:
corpus=open('F:/python/corpus-morf.txt','r')
text=open('F:/python/deberes.txt','r')
lines=corpus.readlines()
for i in lines:
lista=i.split()
#print(lista)
p=len(lista)
if p >0:
forma=lista[1].rstrip()
lema=lista[2].rstrip()
categoria=lista[3].rstrip()
aa=[forma,lema,categoria]
and I'm stuck here.
Do you have any ideas? I sincerely apreciate your help.