5

I am aware that in grep you can simply use ignore.case = TRUE. However, what about strsplit? You can pass a regular expression as the second argument, but I'm not sure how I make this regular expression case insensitive.

Currently, this is what my strsplit looks like, but I want to make the search case insensitive. How would I do so?

strsplit(df$sentence, paste0(" ", df$node, "( |[!\",.:;?})\\]])"))

Example:

sentence <- "De A-bom, Sint..."; 
node <- "a-bom"

contexts <- strsplit(sentence, paste0("(?i) ", node, "( |[!\",.:;?})\\]])"))
(leftContext <- sapply(contexts, `[`, 1))

Expected return:

[1] "De"

Actual return:

[1] "De A-bom, Sint..."

Note, however that the regex itself does work online.

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • 2
    None of the characters in the string `"( |[!\",.:;?})\\]])"` depends on case. – Sven Hohenstein Jul 28 '15 at 07:13
  • @SvenHohenstein No, but the contents of df$node does. – Bram Vanroy Jul 28 '15 at 07:15
  • @akrun I tried that, doesn't work (returns whole sentence). – Bram Vanroy Jul 28 '15 at 07:16
  • 2
    A caveman solution would be to use `tolower` on `sentence` before applying the strsplit. – Roman Luštrik Jul 28 '15 at 07:23
  • The example posted does not work due to the rest of the regex. Try it with the last argument of `paste0` removed, for example. – A. Webb Jul 28 '15 at 07:27
  • @A.Webb What do you mean, "because of the rest of the regex"? Why is that a problem? – Bram Vanroy Jul 28 '15 at 07:28
  • 1
    If I've understood your problem, a similar idea to @RomanLuštrik would be to replace sentence with `gsub(node, node, sentence, ignore.case=TRUE)` which would remove the case problem for the relevant text without changing the case of the rest of the sentence. – ping Jul 28 '15 at 07:35
  • 1
    I think you have to add the `perl=TRUE` argument. This and `tolower` approach to your desired result: `strsplit(tolower(sentence), paste0("(?i) ", node, "( |[!\",.:;?})\\]])"),perl=TRUE)` – nicola Jul 28 '15 at 07:39
  • I might be wrong, but aren't you trying to match at word boundaries? Like [`contexts <- strsplit(sentence, paste0("(?i)[[:blank:]]*\\b", node, "\\b"))`](http://ideone.com/KiAjdz) ? – Wiktor Stribiżew Jul 28 '15 at 07:47
  • @stribizhev You're partially right. I can allow a word boundary before the node, but not after. In Dutch (the language I'm investigating) we often make compounds connected by a hyphen. I want to distinguish between my node and a compound starting with that node. E.g. the regex should match `aids` and not `aids-virus`. If I'd use word boundaries, both would be matched. – Bram Vanroy Jul 28 '15 at 07:52
  • I think you can use this: [`contexts <- strsplit(sentence, "(?i) aids([] !\",.:;?})])")`](http://ideone.com/jYiBHq). The square bracket needs "smart" placement in the character class. Or use `perl=T` as A. Webb suggests. – Wiktor Stribiżew Jul 28 '15 at 08:09

1 Answers1

3

The "(?i)" mode modifier does make PCRE based regexes case insensitive.

The problem with your example is not with case but within the grouping expression. Use perl=TRUE for the escaping behavior you expected.

sentence <- "De A-bom, Sint..."; 
node <- "a-bom"

contexts <- strsplit(sentence, paste0("(?i) ", node, 
    "( |[!\",.:;?})\\]])"),perl=TRUE)
(leftContext <- sapply(contexts, `[`, 1))

Produces the expected

[1] "De"
A. Webb
  • 26,227
  • 1
  • 63
  • 95
  • Ah, that's what you meant! But isn't the problem now that the "array" (what's the right name again?) of punctuation isn't closed? After `|` we open with `[` and we're supposed to close with `]`, but we never do? – Bram Vanroy Jul 28 '15 at 07:44
  • I still don't understand why my online RegEx does work, but it doesn't in R. https://regex101.com/r/zU1fE7/2 – Bram Vanroy Jul 28 '15 at 08:01
  • 1
    See edits, you need `perl=TRUE` for the escaping behavior you wanted. – A. Webb Jul 28 '15 at 08:07