I'm just working thought the text tutorials with data outside the datasets module for work. I get some text data from a dataframe and I have this stored as a string variable for work.
def mergeText(df):
content = ''
for i in df['textColumn']:
content += (i + '. ' )
#print(content)
return content
txt = mergeText(df)
I've worked with spacy a little and I know this is the standard way to create a doc object
nlp = spacy.load('en')
doc1 = nlp(txt)
print(type(doc1))
which outputs
class 'spacy.tokens.doc.Doc
So I should be able to generate a corpus from this doc file as the documentation says
corpus = textacy.corpus.Corpus('en', docs=doc1)
But I get this error even though I'm passing the correct type to the function
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-c6f568014162> in <module>()
----> 1 corpus = textacy.corpus.Corpus('en', docs=doc1, metadatas=None)
~/anaconda3/envs/nlp/lib/python3.6/site-packages/textacy/corpus.py in __init__(self, lang, texts, docs, metadatas)
156 else:
157 for doc in docs:
--> 158 self.add_doc(doc)
159
160 def __repr__(self):
~/anaconda3/envs/nlp/lib/python3.6/site-packages/textacy/corpus.py in add_doc(self, doc, metadata)
337 msg = '`doc` must be {}, not "{}"'.format(
338 {Doc, SpacyDoc}, type(doc))
--> 339 raise ValueError(msg)
340
341 #################
ValueError: `doc` must be {<class 'textacy.doc.Doc'>, <class 'spacy.tokens.doc.Doc'>}, not "<class 'spacy.tokens.token.Token'>"
I have tried to create a textacy object in the same way but with no luck
doc = textacy.Doc(txt)
print(type(doc))
<class 'spacy.tokens.doc.Doc'>
I also tried to use the texts paramater for the corpus passing it the raw text but this outputs
corpus[:10]
[Doc(1 tokens; "D"),
Doc(1 tokens; "e"),
Doc(1 tokens; "a"),
Doc(1 tokens; "r"),
Doc(1 tokens; " "),
Doc(1 tokens; "C"),
Doc(1 tokens; "h"),
Doc(1 tokens; "r"),
Doc(1 tokens; "i"),
Doc(1 tokens; "s")]
Any ideas on how to fix this ?
EDIT In order to get docs form many rows and passing this to a corpus here is the data frame I am working with for a thread
chat1 = df[(df['chat_hash']=='121418-456986')]
So the text for each text is stored under a 'text' column and each of these could be bound to a speaker if necessary via a speaker column.
Currently i'm looking at the capitol words example but it's not entirely clear how to split this using a dataframe.
records = cw.records(speaker_name={'Hillary Clinton', 'Barack Obama'})
text_stream, metadata_stream = textacy.fileio.split_record_fields(records, 'text')
corpus = textacy.Corpus('en', texts=text_stream, metadatas=metadata_stream)
corpus
Would it be a case of setting records in this case to be the filter for the chat hash
thread = df[(df['chat_hash']=='121418-456986')]
text_stream, metadata_stream = textacy.fileio.split_record_fields(thread, 'text')
corpus = textacy.Corpus('en', texts=text_stream, metadatas=metadata_stream)
corpus