2

The textcnt function in R's tau package has a split argument and it's default value is split = "[[:space:][:punct:][:digit:]]+" ç this argumet uses the apostrophe ' to split into words too and I don't want that, how can I modify the argument so it doesn't use the apostrophe to split words?

this code:

`library(tau) text<-"I don't want the function to use the ' to split"

textcnt(text, split = "[[:space:][:punct:][:digit:]]+",method="string",n=1L)`

produces this output:

 don function        i    split        t      the       to      use     want 
   1        1        1        1        1        2        2        1        1 

instead of having don 1 and t 1, i would like to keep don't as 1 word

I have tried to use str_replace_all from stringr to remove the punctuation beforehand and then omit the punct part of the argument in textcnt but then it doesn't use all kind of symbols such as & > or " to split, I have tried to modify the split argument but then it doesn't split the sentence at all or it keeps the symbols

Thank you

  • You mean you want `"[[:space:][:punct:][:digit:]]+"` to stop matching `'` chars? – Wiktor Stribiżew Feb 23 '23 at 13:56
  • I don't want the apostrophe ' to match, i want the function to remove and split by everything as it does by default except for the ' – Alberto Llamas Feb 23 '23 at 13:57
  • Please try to use correct upper case letters, e.g. in the beginning of your title, sentences or the word "I". This would be gentle to your readers. – buhtz Feb 23 '23 at 14:55

1 Answers1

0

With PCRE-based functions you need to use

split = "(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"

Here,

  • (?: - start of a container non-capturing group:
  • (?!') - fail the match if the next char is a ' char
  • [[:space:][:punct:][:digit:]] - matches whitespace, punctuation or digit char
  • )+ - match one or more times (consecutively)
  • '\B - a ' char that is followed with either end of string or a non-word char
  • | - or
  • \B' - a ' that is preceded with either start of string or a non-word char.

With stringr functions, you can use

split = "[[:space:][:punct:][:digit:]--[']]+|'\\B|\\B'"

Here, [[:space:][:punct:][:digit:]--[']] matches all characters matched by [[:space:][:punct:][:digit:]] except the ' chars.

stringr ICU regex flavor supports character class subtraction using this notation.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • that shows thefollowing error Error in FUN(X[[i]], ...) : invalid regular expression '[[:space:][:punct:][:digit:]--[']]+' In addition: Warning message: In FUN(X[[i]], ...) : PCRE pattern compilation error 'invalid range in character class' at '--[']]+' – Alberto Llamas Feb 23 '23 at 14:00
  • @AlbertoLlamas Then use a PCRE version. – Wiktor Stribiżew Feb 23 '23 at 14:03
  • this worked, thak you split = "(?:(?!')[[:space:][:punct:][:digit:]])+" – Alberto Llamas Feb 23 '23 at 14:03
  • in more complex situations it kepps al ' and "" next to a word, do you think there's a way where it only keeps the ' that are between letters, for instance does keep I've but not 'I or "no" and so on, thank you – Alberto Llamas Feb 23 '23 at 14:08
  • @AlbertoLlamas Try `"(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"` – Wiktor Stribiżew Feb 23 '23 at 14:24