1

Brief question as I'm trying to match quotation marks in a sentence token using Quanteda's tokens_lookup() function and valuetype="regex". Based on the information provided here on the regex flavor Quanteda uses, I thought the way to go with would be \Q ... \E, but that didn't do the trick.

library(quanteda) 
# package version: 1.5.2

text <- c("text „some quoted text“ more text", "text « some quoted text » more text")

dict <- dictionary(list(MY_KEY = c("\Q*\E")))
# Error: '\Q' is an unrecognized escape in character string starting ""\Q"

I also tried to match the quotation mark directly "“" which at least seems to be a legal regex pattern, but in the end that didn't work either. Nor did variations of \Q...\E with double backslashes as they are used for word boundaries for instance (\\b).

So the more general question I believe is whether the regular expressions mentioned here are compatible with what Quanteda understands as valuetype="regex".

EDIT:

This worked for the first string, yet not for the second.

dict <- dictionary(list(MY_KEY = c(".\".")))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Dr. Fabian Habersack
  • 1,111
  • 12
  • 30
  • What do you mean by "match" quotation marks? Do you want to select all tokens that are a form of quotation mark? (The default quanteda tokenizer will make your quotation variants into separate tokens.) – Ken Benoit Jun 19 '20 at 14:50

2 Answers2

1

Regular expressions in quanteda are built on the stringi package, which supports Unicode character categories. You can retrieve all of your quotes by using these categories in a search pattern:

  • Ps, Pe - punctuation, open and close
  • Pi, Pf - Punctuation initial and final quote

I included all four, since for example is in Ps but not Pi, and « is in Pi but not Ps.

Further details are here.

library("quanteda")
## Package version: 2.0.1

text <- c(
  "text „some quoted text“ more text",
  "text « some quoted text » more text"
)
toks <- tokens(text)

tokens_select(toks, "[\\p{Pf}\\p{Pi}\\p{Ps}\\p{Pe}]", valuetype = "regex")
## Tokens consisting of 2 documents.
## text1 :
## [1] "„"
## 
## text2 :
## [1] "«" "»"
Ken Benoit
  • 14,454
  • 27
  • 50
  • Thanks Ken, that's incredibly helpful! ... I am using `what="sentence"` which makes this a bit trickier and also my idea was to define a pattern that allows me to match any sentence containing a specific word on the condition that this word has NO quotation marks around it. Say I'd like to match "I like apples a lot" but not "I like "apples" a lot". Thought that might work with `\\P{}` instead of `\\p{}` but so far it didn't. – Dr. Fabian Habersack Jun 19 '20 at 18:41
  • To provide some context: `text<-c("xxx apple xxx","xxx „apple“ xxx")` ... `tokens(text, what="sentence") %>% ("apple[\\P{Pf}\\P{Pi}\\P{Ps}\\P{Pe}]", valuetype = "regex")` Expectation: match / no match. Reality: match / match. – Dr. Fabian Habersack Jun 19 '20 at 18:59
  • 1
    That's really not part of the original question, but I could demonstrate how to answer that if you specify a clear new question with the details you are looking for and an expected result. Be sure to specify if you are looking for a set of sentences with the match/no match, or whether you want to identify quoted or non-quoted target terms (as terms, not sentences). – Ken Benoit Jun 20 '20 at 08:43
  • Right, thanks. I was already thinking about asking a new question. My idea behind this one right here was that it would easily allow me to solve the problem if I knew how to identify characters like quotation marks. Seems that this is not the case, so I'll formulate a new one right away. On the match sentences vs. identify target terms: I'm trying to "match" any sentence containing a specific target term, but not if the target term is put into quotation marks. So that would be the condition for the match / no match. – Dr. Fabian Habersack Jun 20 '20 at 10:18
  • Oh never mind, I figured it out. Do you think I should still ask the question and answer it myself? I doubt that that would be relevant to others, since it's mostly just about the use of `[^...]`. Still, I'm wondering why some of the `stringi` RegExes don't work the way they should such as `[\\P{...}\\P{...}]` with capital letter `P` which should equal `[^\\p{...}\\p{...}]` if I'm not mistaken. – Dr. Fabian Habersack Jun 20 '20 at 10:56
0

is it possible it is a language or locale issue? Your "quotation marks" don't look like quotation marks on my screen and when I change the pattern I can find them.

library(quanteda) 
#> Package version: 2.0.1

text <- c("text „some quoted text“ more text", "text « some quoted text » more text")

dict <- dictionary(list(found_it = c("„"), found_other = c("«")))

toks2 <- tokens(text)
tokens_lookup(toks2, dict)

#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "found_it"
#> 
#> text2 :
#> [1] "found_other"
Chuck P
  • 3,862
  • 3
  • 9
  • 20
  • I just found out it has something to do with the `what="sentence"` which doesn't allow me to match the quotation mark but which I would really like to keep. Is there a workaround? – Dr. Fabian Habersack Jun 18 '20 at 14:24
  • Second, I am still wondering if there is a universal regex for matching any quotation mark, no matter the style of the quotation mark, like `\Q...\E` (which, as I said, doesn't work). – Dr. Fabian Habersack Jun 18 '20 at 14:26
  • 1
    It's very unlikely there is a "universal" regex for all "styles" of quotation marks. Almost by definition if you are tokenizing at the sentence level `what = "sentence"` you shouldn't expect `tokens_lookup` to find quotation marks as separate tokens. If `toks2` is tokenized at sentence level then you can use `grepl("«" ,toks2)` to find all the sentences that have that character in it. – Chuck P Jun 18 '20 at 14:36
  • Well it does work if I search for `c(".\".")`, even with `what="sentence"`, at least for the first string that is. And on the universal thing: I read the description of `\Q...\E` to be a universal pattern for quoted strings. The catch is just that I get an error message saying that this is not a legal regex but other than that... ;-) – Dr. Fabian Habersack Jun 18 '20 at 14:44