keeping the apostrophe using the textcnt function from the tau package in R

Question

The textcnt function in R's tau package has a split argument and it's default value is split = "[[:space:][:punct:][:digit:]]+" ç this argumet uses the apostrophe ' to split into words too and I don't want that, how can I modify the argument so it doesn't use the apostrophe to split words?

this code:

`library(tau) text<-"I don't want the function to use the ' to split"

textcnt(text, split = "[[:space:][:punct:][:digit:]]+",method="string",n=1L)`

produces this output:

 don function        i    split        t      the       to      use     want 
   1        1        1        1        1        2        2        1        1

instead of having don 1 and t 1, i would like to keep don't as 1 word

I have tried to use str_replace_all from stringr to remove the punctuation beforehand and then omit the punct part of the argument in textcnt but then it doesn't use all kind of symbols such as & > or " to split, I have tried to modify the split argument but then it doesn't split the sentence at all or it keeps the symbols

Thank you

You mean you want `"[[:space:][:punct:][:digit:]]+"` to stop matching `'` chars? — Wiktor Stribiżew, Feb 23 '23 at 13:56
I don't want the apostrophe ' to match, i want the function to remove and split by everything as it does by default except for the ' — Alberto Llamas, Feb 23 '23 at 13:57
Please try to use correct upper case letters, e.g. in the beginning of your title, sentences or the word "I". This would be gentle to your readers. — buhtz, Feb 23 '23 at 14:55

Wiktor Stribiżew · Accepted Answer · 2023-02-23T14:37:50.833

0

With PCRE-based functions you need to use

split = "(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"

Here,

(?: - start of a container non-capturing group:
(?!') - fail the match if the next char is a ' char
[[:space:][:punct:][:digit:]] - matches whitespace, punctuation or digit char
)+ - match one or more times (consecutively)
'\B - a ' char that is followed with either end of string or a non-word char
| - or
\B' - a ' that is preceded with either start of string or a non-word char.

With stringr functions, you can use

split = "[[:space:][:punct:][:digit:]--[']]+|'\\B|\\B'"

Here, [[:space:][:punct:][:digit:]--[']] matches all characters matched by [[:space:][:punct:][:digit:]] except the ' chars.

stringr ICU regex flavor supports character class subtraction using this notation.

edited Feb 23 '23 at 14:37

answered Feb 23 '23 at 13:58

Wiktor Stribiżew

607,720
39
448
563

that shows thefollowing error Error in FUN(X[[i]], ...) : invalid regular expression '[[:space:][:punct:][:digit:]--[']]+' In addition: Warning message: In FUN(X[[i]], ...) : PCRE pattern compilation error 'invalid range in character class' at '--[']]+' – Alberto Llamas Feb 23 '23 at 14:00
@AlbertoLlamas Then use a PCRE version. – Wiktor Stribiżew Feb 23 '23 at 14:03
this worked, thak you split = "(?:(?!')[[:space:][:punct:][:digit:]])+" – Alberto Llamas Feb 23 '23 at 14:03
in more complex situations it kepps al ' and "" next to a word, do you think there's a way where it only keeps the ' that are between letters, for instance does keep I've but not 'I or "no" and so on, thank you – Alberto Llamas Feb 23 '23 at 14:08
@AlbertoLlamas Try `"(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"` – Wiktor Stribiżew Feb 23 '23 at 14:24

keeping the apostrophe using the textcnt function from the tau package in R

1 Answers1