TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

Question

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages.

These will usually be preceded with a word "Address", "telephone number", "name", "company", "hospital", "deliverer". I will have a dictionary of these words.

I am wondering if text mining tools would be perfect for the job. I would like to create a Corpus for all these documents and then find texts that meet specific (i am thinking about regex criteria) on the right or down of the given dictionary entry.

Is there such a syntax in data mining packages in R, ie. to get the strings on the right or down of the wordlist entry, the strings that meet a specific pattern?

If not, would be more suitable tool in R to do the job?

score 1 · Answer 1 · answered Jul 31 '17 at 08:46

Two options with quanteda come to mind:

Use kwic with your list of target patterns, with a window big enough to capture the size after the term that you want. This will return a data.frame that you can use the keyword and post columns from for your analysis. You can also construct a corpus directly from this object (corpus(mykwic)) and then focus on the new post docvar which will contain the text you want.
Use corpus_segment where you use the target word list to create a "tag" type, and anything following this tag, until the next tag, will be reshaped into a new document. This works well but is a bit trickier to configure, since you will need to get the regex right for the tag.

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

1 Answers1