8

I have some strings that can contain letters, numbers and '#' symbol.

I would like to remove digits except for the words that start with '#'

Here is an example:

"table9 dolv5e #10n #dec10 #nov8e 23 hello"

And the expected output is:

"table dolve #10n #dec10 #nov8e  hello"

How can I do this with regex, stringr or gsub?

castaa95
  • 83
  • 4

5 Answers5

5

You could split the string on spaces, remove digits from tokens if they don't start with '#' and paste back:

x <- "table9 dolv5e #10n #dec10 #nov8e 23 hello"
y <- unlist(strsplit(x, ' '))
paste(ifelse(startsWith(y, '#'), y, sub('\\d+', '', y)), collapse = ' ')
# output 
[1] "table dolve #10n #dec10 #nov8e  hello"
user2474226
  • 1,472
  • 1
  • 9
  • 9
5

How about capturing the wanted and replacing the unwanted with empty (non captured).

gsub("(#\\S+)|\\d+","\\1",x)

See demo at regex101 or R demo at tio.run (I have no experience with R)

My Answer is assuming, that there is always whitespace between #foo bar #baz2. If you have something like #foo1,bar2:#baz3 4, use \w (word character) instead of \S (non whitespace).

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
1

You use gsub to remove digits, for example:

gsub("[0-9]","","table9")
"table"

And we can split your string using strsplit:

STRING = "table9 dolv5e #10n #dec10 #nov8e 23 hello"
strsplit(STRING," ")
[[1]]
[1] "table9" "dolv5e" "#10n"   "#dec10" "#nov8e" "23"     "hello"

We just need to iterate through STRING, with gsub, applying it only to elements that do not have "#"

STRING = unlist(strsplit(STRING," "))
no_hex = !grepl("#",STRING)
STRING[no_hex] = gsub("[0-9]","",STRING[no_hex])
paste(STRING,collapse=" ")
[1] "table dolve #10n #dec10 #nov8e  hello"
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
0

Base R solution:

unlisted_strings <- unlist(strsplit(X, "\\s+"))

Y <- paste0(na.omit(ifelse(grepl("[#]", unlisted_strings),

                           unlisted_strings,

                           gsub("\\d+", "", unlisted_strings))), collapse = " ")

Y 

Data:

X <- as.character("table9 dolv5e #10n #dec10 #nov8e 23 hello")
hello_friend
  • 5,682
  • 1
  • 11
  • 15
0
INPUT = "table9 dolv5e #10n #dec10 #nov8e 23 hello";
OUTPUT = INPUT.match(/[^#\d]+(#\w+|[A-Za-Z]+\w*)/gi).join('');

You can remove flags i, cause it was case insensitive

Use this pattern: [^#\d]+(#\w+|[A-Za-Z]+\w*)

[^#\d]+ = character start with no # and digits #\w+ = find # followed by digit or letter [A-Za-z]+\w* = find letter followed by letter and/or number ^ | You can change this with \D+\S* = find any character not just when the first is letter and not just followed by letter and/or number. I am not put as \w+\w* cause \w same as = [\w\d].

I tried the code in JavaScript and it work. If you want match not only followed by letter you can use code

Transamunos
  • 101
  • 4