1

I found this reference : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch05s07.html Is it possible to use it with kwicfunction in the quanteda package to be able to find documents in a corpus containing words that are not "stuck" but close to each other, with maybe a few other words between ?

for example, if I give two words in the function, I would like to find the documents in a corpus where these two words occur but maybe with some words between. For example, you tell me "engine" and "electrical", I will also get the reports where "electrical synchronous engine" appears but not the ones in which "engine" and "electrical" appear in completely different contexts.

MysteryGuy
  • 1,091
  • 2
  • 18
  • 43
  • `kwic()` by definition finds the words surrounding a target word or phrase. Can you elaborate on what sort of search you want to execute, and what sort of end result you want? – Ken Benoit Apr 18 '18 at 20:44
  • @KenBenoit, for example, if I give two words in the function, I would like to find the documents in a corpus where these two words occur but maybe with some words between. For example, you tell me "engine" and "electrical", I will also get the reports where "electrical synchronous engine" appears but not the ones in which "engine" and "electrical" appear in completely different contexts. I hope it is clearer... – MysteryGuy Apr 19 '18 at 11:22
  • @KenBenoit I've just edited my question by the way – MysteryGuy Apr 19 '18 at 14:05
  • See https://stackoverflow.com/questions/49872839/logical-combinations-in-quanteda-dictionaries/49877254#comment86791171_49877254, same issue. Maybe Kohei can think up a workaround. – Ken Benoit Apr 19 '18 at 15:48

1 Answers1

1

quanteda does not have a NEAR operator, but you can do the same thing using window argument of tokens_select(). In this example, I am searching words five words from "america*" uisng kwic():

require(quanteda)
toks <- tokens(data_corpus_inaugural)
toks_america <- tokens_select(toks, "america*", window = 5)

kwic(toks_america, "econom*")
# [2013-Obama, 45] has been tested by crises | economic | recovery has begun. America's

kwic(toks_america, "power")
# [1997-Clinton, 85] it can give Americans the | power | to make a government is
Kohei Watanabe
  • 750
  • 3
  • 6
  • It is fine when the word in `tokens_select`is unique. But what to do if I have `america great again`for example. In that case, does not seem to work... How to deal with it please ? – MysteryGuy Apr 20 '18 at 14:20
  • 1
    I am not sure what you mean by "unique" but you can just `tokens_select(toks, phrase("america great again"), window = 5)` if a pattern is a phrase. – Kohei Watanabe Apr 20 '18 at 20:46
  • Yeah, something like that should meet my needs, I will test it soon – MysteryGuy Apr 21 '18 at 06:42