-1

I am trying to identify regex patterns in text, but kwic() does not identify regex phrases that are longer than just one word. I tried to use phrase(), but that did not work either.

To give you an example:

mycorpus = corpus(bla$`TEXT` )
foo = kwic(mycorpus, pattern = "\\bno\\b", window = 10, valuetype = "regex" ) #gives 1959 obs. 
foo = kwic(mycorpus, pattern = "\\bno\\b\\s{0,5}\\w+", window = 10, valuetype = "regex" ) #gives 0 obs.
foo = kwic(mycorpus, pattern = "no\\sother", window = 10, valuetype = "regex" ) #gives 0 obs. even though it should find 3 phrases

even though there are multiple patterns in the text that should be identified.

Thanks for the help!

Sherls
  • 31
  • 3

1 Answers1

0

That's because kwic searches tokens, and tokens no longer contain spaces. To search for a sequence of tokens, what quanteda treats as a "phrase", wrap the pattern in phrase(). (See also ?phrase.)

library("quanteda")
## Package version: 2.0.0

txt <- "one two three four five"

# no match
kwic(txt, "one\\stwo", valuetype = "regex", window = 1)
## kwic object with 0 rows

# match
kwic(txt, phrase("one two"), valuetype = "regex", window = 1)
##                                 
##  [text1, 1:2]  | one two | three
Ken Benoit
  • 14,454
  • 27
  • 50
  • Thank you @Ken Benoit! That almost solves my problem. How come that "\\s" in the regex pattern does not work but a normal white space does? Match: ```kwic(mycorpus, pattern = phrase("others \\."), window = 10, valuetype = "regex" )``` no match: ``` kwic(mycorpus, pattern = phrase("others\\s\\."), window = 10, valuetype = "regex" )``` Also, do you know how I would be able to retrieve patterns that work with repetition operators? So for example: ```kwic(mycorpus, pattern = phrase("no \\w+")``` only gives me phrases with 1 word after "no", even though it should give me all 1+ words. – Sherls Mar 17 '20 at 13:51
  • `phrase()` considers sequences of tokens, so if you want to match two consecutive tokens, and your second is ".", then you need to specify `phrase("others .")` (for `valuetype = "glob"` here). – Ken Benoit Mar 18 '20 at 11:02
  • How can I search for regex patterns like this then: `phrase("hello (friends|dear audience)") valuetype="regex"` – Sherls Mar 29 '20 at 16:21