Ignore case in strsplit in R

Question

I am aware that in grep you can simply use ignore.case = TRUE. However, what about strsplit? You can pass a regular expression as the second argument, but I'm not sure how I make this regular expression case insensitive.

Currently, this is what my strsplit looks like, but I want to make the search case insensitive. How would I do so?

strsplit(df$sentence, paste0(" ", df$node, "( |[!\",.:;?})\\]])"))

Example:

sentence <- "De A-bom, Sint..."; 
node <- "a-bom"

contexts <- strsplit(sentence, paste0("(?i) ", node, "( |[!\",.:;?})\\]])"))
(leftContext <- sapply(contexts, `[`, 1))

Expected return:

[1] "De"

Actual return:

[1] "De A-bom, Sint..."

Note, however that the regex itself does work online.

None of the characters in the string `"( |[!\",.:;?})\\]])"` depends on case. — Sven Hohenstein, Jul 28 '15 at 07:13
A caveman solution would be to use `tolower` on `sentence` before applying the strsplit. — Roman Luštrik, Jul 28 '15 at 07:23
The example posted does not work due to the rest of the regex. Try it with the last argument of `paste0` removed, for example. — A. Webb, Jul 28 '15 at 07:27
@A.Webb What do you mean, "because of the rest of the regex"? Why is that a problem? — Bram Vanroy, Jul 28 '15 at 07:28
If I've understood your problem, a similar idea to @RomanLuštrik would be to replace sentence with `gsub(node, node, sentence, ignore.case=TRUE)` which would remove the case problem for the relevant text without changing the case of the rest of the sentence. — ping, Jul 28 '15 at 07:35
I think you have to add the `perl=TRUE` argument. This and `tolower` approach to your desired result: `strsplit(tolower(sentence), paste0("(?i) ", node, "( |[!\",.:;?})\\]])"),perl=TRUE)` — nicola, Jul 28 '15 at 07:39
I might be wrong, but aren't you trying to match at word boundaries? Like [`contexts <- strsplit(sentence, paste0("(?i)[[:blank:]]*\\b", node, "\\b"))`](http://ideone.com/KiAjdz) ? — Wiktor Stribiżew, Jul 28 '15 at 07:47
@stribizhev You're partially right. I can allow a word boundary before the node, but not after. In Dutch (the language I'm investigating) we often make compounds connected by a hyphen. I want to distinguish between my node and a compound starting with that node. E.g. the regex should match `aids` and not `aids-virus`. If I'd use word boundaries, both would be matched. — Bram Vanroy, Jul 28 '15 at 07:52
I think you can use this: [`contexts <- strsplit(sentence, "(?i) aids([] !\",.:;?})])")`](http://ideone.com/jYiBHq). The square bracket needs "smart" placement in the character class. Or use `perl=T` as A. Webb suggests. — Wiktor Stribiżew, Jul 28 '15 at 08:09

A. Webb · Accepted Answer · 2015-07-28T08:07:29.857

3

The "(?i)" mode modifier does make PCRE based regexes case insensitive.

The problem with your example is not with case but within the grouping expression. Use perl=TRUE for the escaping behavior you expected.

sentence <- "De A-bom, Sint..."; 
node <- "a-bom"

contexts <- strsplit(sentence, paste0("(?i) ", node, 
    "( |[!\",.:;?})\\]])"),perl=TRUE)
(leftContext <- sapply(contexts, `[`, 1))

Produces the expected

[1] "De"

edited Jul 28 '15 at 08:07

answered Jul 28 '15 at 07:42

A. Webb

26,227
1
63
95

Ah, that's what you meant! But isn't the problem now that the "array" (what's the right name again?) of punctuation isn't closed? After `|` we open with `[` and we're supposed to close with `]`, but we never do? – Bram Vanroy Jul 28 '15 at 07:44
I still don't understand why my online RegEx does work, but it doesn't in R. https://regex101.com/r/zU1fE7/2 – Bram Vanroy Jul 28 '15 at 08:01
1

See edits, you need `perl=TRUE` for the escaping behavior you wanted. – A. Webb Jul 28 '15 at 08:07

Ignore case in strsplit in R

1 Answers1