3

I have some text with phrases containing numbers, followed by a number of symbols. I want to extract them, for example, numbers followed by percentages. Using kwic function from quanteda package seems to work for numbers as regular expressions ("\\d{1,}" for example). Nevertheless, I don't find how to extract it followed by a percentage sign, using quanteda. The following text might serve as a text example:

Thirteen (7%) of 187 patients acquired C. difficile in ICU-1, 9 (36%) of 25 on ICU-2 and 3 (5.9%) of 51 patients in BU. Eight (32%) developed diarrhoea attributable only to C. difficile and/ or toxin, and the remaining 17 (68%) were asymptomat- ic: none had pseudomembranous colitis.

Ken Benoit
  • 14,454
  • 27
  • 50
panchtox
  • 634
  • 7
  • 16
  • What function are you using? Unsure about the `quanteda` package but this regex should do the job: `\\d+%` – Amar Apr 11 '18 at 01:04
  • I know, but it is not working in kwic function from quanteda. I'm trying to use quanteda for it produces better phrase extraction around regex match so, this is not a solution to my question. Thanks anyway – panchtox Apr 11 '18 at 03:43

2 Answers2

2

The quanteda package is handling regex rather oddly. I'm unsure why this solution works but I think it has something to do with how kwic treats the specified pattern. Wrapping pattern with phrase function and adding a space returns the correct results:

s <- c("Thirteen (7%) of 187 patients acquired C. difficile in ICU-1, 9 (36%) of 25 on ICU-2 and 3 (5.9%) of 51 patients in BU. Eight (32%) developed diarrhoea attributable only to C. difficile and/ or toxin, and the remaining 17 (68%) were asymptomat- ic: none had pseudomembranous colitis.")

kwic(s, phrase("\\d+ %"), valuetype = "regex")

I would suggest you contact the package maintainers and point out this issue. Seems counter-intuitive.

Amar
  • 1,340
  • 1
  • 8
  • 20
2

The reason is that when you call kwic() on a corpus or character object directly, it passes some arguments to tokens() that affect how the tokenization occurs, prior to the keywords-in-context analysis. (This is documented in the ... parameter in ?kwic.)

The default tokenisation in quanteda uses the stringi word boundary definitions, so that:

tokens("Thirteen (7%) of 187")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "("        "7"        "%"        ")"        "of"       "187" 

If you want to use a simpler, whitespace tokeniser, this can be accomplished using:

tokens("Thirteen (7%) of 187", what = "fasterword")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "(7%)"     "of"       "187" 

So, the way to use this as you are wanting in kwic() would be:

kwic(s, "\\d+%", valuetype = "regex", what = "fasterword")

#  [text1, 2]                    Thirteen |  (7%)  | of 187 patients acquired C.             
# [text1, 12]    C. difficile in ICU-1, 9 | (36%)  | of 25 on ICU-2 and                      
# [text1, 19]           25 on ICU-2 and 3 | (5.9%) | of 51 patients in BU.                   
# [text1, 26]    51 patients in BU. Eight | (32%)  | developed diarrhoea attributable only to
# [text1, 41] toxin, and the remaining 17 | (68%)  | were asymptomat- ic: none had  

Otherwise, you need to wrap the regex in a phrase() function, and separate the elements by whitespace:

kwic(s, phrase("\\d+ %"), valuetype = "regex")

#   [text1, 3:4]             Thirteen( |  7 %  | ) of 187 patients acquired             
# [text1, 18:19]          in ICU-1, 9( | 36 %  | ) of 25 on ICU-2                       
# [text1, 28:29]       on ICU-2 and 3( | 5.9 % | ) of 51 patients in                    
# [text1, 39:40]         in BU. Eight( | 32 %  | ) developed diarrhoea attributable only
# [text1, 60:61] and the remaining 17( | 68 %  | ) were asymptomat- ic  

This behaviour might take a bit of getting used to, but it's the best way of ensuring complete user control over searching for multi-token sequences, rather than implementing a single way of determining what should be the elements of a multi-token sequence when the inputs have yet to be tokenised.

Ken Benoit
  • 14,454
  • 27
  • 50