Is it possible to use `kwic` function to find words near to each other?

Question

I found this reference : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch05s07.html Is it possible to use it with kwicfunction in the quanteda package to be able to find documents in a corpus containing words that are not "stuck" but close to each other, with maybe a few other words between ?

for example, if I give two words in the function, I would like to find the documents in a corpus where these two words occur but maybe with some words between. For example, you tell me "engine" and "electrical", I will also get the reports where "electrical synchronous engine" appears but not the ones in which "engine" and "electrical" appear in completely different contexts.

`kwic()` by definition finds the words surrounding a target word or phrase. Can you elaborate on what sort of search you want to execute, and what sort of end result you want? — Ken Benoit, Apr 18 '18 at 20:44
@KenBenoit, for example, if I give two words in the function, I would like to find the documents in a corpus where these two words occur but maybe with some words between. For example, you tell me "engine" and "electrical", I will also get the reports where "electrical synchronous engine" appears but not the ones in which "engine" and "electrical" appear in completely different contexts. I hope it is clearer... — MysteryGuy, Apr 19 '18 at 11:22
See https://stackoverflow.com/questions/49872839/logical-combinations-in-quanteda-dictionaries/49877254#comment86791171_49877254, same issue. Maybe Kohei can think up a workaround. — Ken Benoit, Apr 19 '18 at 15:48

score 1 · Answer 1 · answered Apr 20 '18 at 11:02

1

quanteda does not have a NEAR operator, but you can do the same thing using window argument of tokens_select(). In this example, I am searching words five words from "america*" uisng kwic():

require(quanteda)
toks <- tokens(data_corpus_inaugural)
toks_america <- tokens_select(toks, "america*", window = 5)

kwic(toks_america, "econom*")
# [2013-Obama, 45] has been tested by crises | economic | recovery has begun. America's

kwic(toks_america, "power")
# [1997-Clinton, 85] it can give Americans the | power | to make a government is

answered Apr 20 '18 at 11:02

Kohei Watanabe

750
3
6

It is fine when the word in `tokens_select`is unique. But what to do if I have `america great again`for example. In that case, does not seem to work... How to deal with it please ? – MysteryGuy Apr 20 '18 at 14:20
1

I am not sure what you mean by "unique" but you can just `tokens_select(toks, phrase("america great again"), window = 5)` if a pattern is a phrase. – Kohei Watanabe Apr 20 '18 at 20:46
Yeah, something like that should meet my needs, I will test it soon – MysteryGuy Apr 21 '18 at 06:42

Is it possible to use `kwic` function to find words near to each other?

1 Answers1

Linked