-1

I am trying to build a corpus and parse it with spacy. Corpus consists of over 2000 individual text files each has specific document meta such as filename, gender, nationally, etc.stored in each row in the data frame.

what I did so far is to create a EXCEL file consisting of metadata and text_field. text_field is where the actual texts are stored. Then I imported it as pandas, parsed text_field with Spacy via following codes;

import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm')
df = pd.read_excel('C:/Users/Desktop/Data.xlsx')
df['docs'] = list(nlp.pipe(df.text_field))

However, I would like to iterate over all docs stored in data frame and extract outputs with metadata provided in data frame as well.

For instance, common Spacy output is like this;

doc = nlp('this is a test sentence')
for token in doc:
print(token.lemma_, token.text, token.pos_)

lemma / text / pos_
this   this   DET
be     is     AUX
a      a      DET
test   test   NOUN
sentence   sentence   NOUN

Expected out is like this;

 file(text/doc metadata) /  lemma / text / pos_
    text1                   this    this   DET
    text1                   be      is     AUX
    text1                   a       a      DET
    text1                   test    test   NOUN
    text1                   sentence sentence NOUN
    text2                   this     this     DET
    text2                   be       is       AUX
    text2                   another  another  DET
    text2                   sentence sentence NOUN
  1. When I apply the df['docs'] = list(nlp.pipe(df.text_field)), docs column only consists of text, not doc objects.
  2. How should I proceed to get the expected output? This is possible in R with Quanteda package, creating corpus and tokenizing etc, but is there a way to do the same on Python with Spacy?

1 Answers1

0

Following the tutorial here I managed to create a corpus with metadata integrated. For those who might need;

import pandas as pd
import spacy
import textacy
nlp = spacy.load('en_core_web_sm')
df = pd.read_excel('E:/test.xlsx')
native_language = (df
               .sort_values(['Native_language', 
                             'Gender'])
               .groupby('Native_language')['text_field']
               .agg(lambda col: '\n'.join(col)))
docs = list(native_language.apply(lambda x: nlp(x)))
corpus = textacy.corpus.Corpus(nlp, data=docs)

In the original answer, corpus = textacy.corpus.Corpus(nlp, **docs**=docs) was provided however, textacy now requires data rather docs, therefore the correct code is corpus = textacy.corpus.Corpus(nlp, **data**=docs)

E_net4
  • 27,810
  • 13
  • 101
  • 139
  • 1
    Hey, I'm not the person who downvoted you, but your question is kind of vague/hard to understand, and additionally there's no reason to use screenshots. – polm23 Jun 28 '21 at 03:45
  • @polm23 I see, yet it is a question after all. I should edit the post for clarity. – Fatih Bozdağ Jun 28 '21 at 05:34