I am trying to build a corpus and parse it with spacy. Corpus consists of over 2000 individual text files each has specific document meta such as filename, gender, nationally, etc.stored in each row in the data frame.
what I did so far is to create a EXCEL file consisting of metadata and text_field. text_field is where the actual texts are stored. Then I imported it as pandas, parsed text_field with Spacy via following codes;
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm')
df = pd.read_excel('C:/Users/Desktop/Data.xlsx')
df['docs'] = list(nlp.pipe(df.text_field))
However, I would like to iterate over all docs stored in data frame and extract outputs with metadata provided in data frame as well.
For instance, common Spacy output is like this;
doc = nlp('this is a test sentence')
for token in doc:
print(token.lemma_, token.text, token.pos_)
lemma / text / pos_
this this DET
be is AUX
a a DET
test test NOUN
sentence sentence NOUN
Expected out is like this;
file(text/doc metadata) / lemma / text / pos_
text1 this this DET
text1 be is AUX
text1 a a DET
text1 test test NOUN
text1 sentence sentence NOUN
text2 this this DET
text2 be is AUX
text2 another another DET
text2 sentence sentence NOUN
- When I apply the df['docs'] = list(nlp.pipe(df.text_field)), docs column only consists of text, not doc objects.
- How should I proceed to get the expected output? This is possible in R with Quanteda package, creating corpus and tokenizing etc, but is there a way to do the same on Python with Spacy?