1

I'm just working thought the text tutorials with data outside the datasets module for work. I get some text data from a dataframe and I have this stored as a string variable for work.

def mergeText(df):
    content = ''
    for i in df['textColumn']:
        content += (i + '. ' )
    #print(content)
    return content


     txt = mergeText(df)

I've worked with spacy a little and I know this is the standard way to create a doc object

nlp = spacy.load('en')
doc1 = nlp(txt)
print(type(doc1))

which outputs

class 'spacy.tokens.doc.Doc

So I should be able to generate a corpus from this doc file as the documentation says

corpus = textacy.corpus.Corpus('en', docs=doc1)

But I get this error even though I'm passing the correct type to the function

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-c6f568014162> in <module>()
----> 1 corpus = textacy.corpus.Corpus('en', docs=doc1, metadatas=None)

~/anaconda3/envs/nlp/lib/python3.6/site-packages/textacy/corpus.py in __init__(self, lang, texts, docs, metadatas)
    156             else:
    157                 for doc in docs:
--> 158                     self.add_doc(doc)
    159 
    160     def __repr__(self):

~/anaconda3/envs/nlp/lib/python3.6/site-packages/textacy/corpus.py in add_doc(self, doc, metadata)
    337             msg = '`doc` must be {}, not "{}"'.format(
    338                 {Doc, SpacyDoc}, type(doc))
--> 339             raise ValueError(msg)
    340 
    341     #################

ValueError: `doc` must be {<class 'textacy.doc.Doc'>, <class 'spacy.tokens.doc.Doc'>}, not "<class 'spacy.tokens.token.Token'>"

I have tried to create a textacy object in the same way but with no luck

doc = textacy.Doc(txt)
print(type(doc))

<class 'spacy.tokens.doc.Doc'>

I also tried to use the texts paramater for the corpus passing it the raw text but this outputs

corpus[:10]

[Doc(1 tokens; "D"),
 Doc(1 tokens; "e"),
 Doc(1 tokens; "a"),
 Doc(1 tokens; "r"),
 Doc(1 tokens; " "),
 Doc(1 tokens; "C"),
 Doc(1 tokens; "h"),
 Doc(1 tokens; "r"),
 Doc(1 tokens; "i"),
 Doc(1 tokens; "s")]

Any ideas on how to fix this ?

EDIT In order to get docs form many rows and passing this to a corpus here is the data frame I am working with for a thread

 chat1 = df[(df['chat_hash']=='121418-456986')]

So the text for each text is stored under a 'text' column and each of these could be bound to a speaker if necessary via a speaker column.

Currently i'm looking at the capitol words example but it's not entirely clear how to split this using a dataframe.

records = cw.records(speaker_name={'Hillary Clinton', 'Barack Obama'})
text_stream, metadata_stream = textacy.fileio.split_record_fields(records, 'text')
corpus = textacy.Corpus('en', texts=text_stream, metadatas=metadata_stream)
corpus

Would it be a case of setting records in this case to be the filter for the chat hash

thread = df[(df['chat_hash']=='121418-456986')]
text_stream, metadata_stream = textacy.fileio.split_record_fields(thread, 'text')
corpus = textacy.Corpus('en', texts=text_stream, metadatas=metadata_stream)
corpus

1 Answers1

3

The docs parameter is expecting an iterable, and the items of the iterable to be of the various Doc types. You are passing a single document, which when iterated returns Tokens - hence the error. You can wrap your doc=doc1 parameter to be doc=[doc1] and that should allow you to create the corpus.

This though is a corpus containing a single document - which is unlikely very useful. Do you mean to be creating a Doc for each row of your DataFrame, rather than concatenating together?

EDIT: Dealing with DataFrame

If you want each chat to be a document, one of of doing that is to group the dataframe by the chat_hash and concatonate all the text together. Then create a document for each chat and corpus for that:

import pandas as pd
import spacy
import textacy

nlp = spacy.load('en')

df = pd.DataFrame([['Ken', 'aaaa', 1, 'This is a thing I said'],
                  ['Peachy', 'aaaa', 2, 'This was a response'],
                  ['Ken', 'aaaa', 3, 'I agree!'],
                  ['Ken', 'bbbb', 1, 'This is a thing I said'],
                  ['Peachy', 'bbbb', 2, 'You fool!']], columns=['speaker', 'chat_hash', 'sequence_number', 'text'])

chat_concat = (df
               .sort_values(['chat_hash', 
                             'sequence_number'])
               .groupby('chat_hash')['text']
               .agg(lambda col: '\n'.join(col)))

docs = list(chat_concat.apply(lambda x: nlp(x)))

corpus = textacy.corpus.Corpus(nlp, docs=docs)

corpus

So the steps in this are:

  • Load the model (and create dummy dataframe in this case)
  • Sort by hash and some sequence (so chat in the right order), then group by the chat hash and join all the text together (I'm using new lines between text, can use any delimiter
  • Apply a function to each block of text to create a document out of it
  • Create the corpus as before.
Ken Syme
  • 3,532
  • 2
  • 17
  • 19
  • well i'm working with sentences in a thread which is multiple rows yes. So I assume I need to split the data accordingly similarly to this example here with [capital words](http://textacy.readthedocs.io/en/stable/index.html) – PeachyDinosaur Dec 12 '17 at 10:29
  • So what do rows in your dataframe represent? Is each one a sentence from a thread or each a post from a thread? – Ken Syme Dec 12 '17 at 13:25
  • each row has a conversation for the 'text' column. – PeachyDinosaur Dec 12 '17 at 14:57
  • I have added an edit for how you could take a sample dataframe and transform into a corpus - let me know if that is the sort of thing you are needing. – Ken Syme Dec 13 '17 at 11:22