6

I'm trying to trim trailing square brackets, inner quotes and slashes in a list of R strings, preferably using dplyr.

Sample data:

df <- c("['Mamie Smith']", "[\"Screamin' Jay Hawkins\"]")

Expected result:

"Mamie Smith", "Screamin' Jay Hawkins"

What I have tried:

gsub("[[]]", "", df) # Throws error
df %>%
  str_replace("[[]]", "") # Also throws error
Laurel
  • 5,965
  • 14
  • 31
  • 57

6 Answers6

3

In base R we can make use of trimws function:

if we are not interested in the non word parts:

trimws(df, whitespace = "\\W+")
[1] "Mamie Smith"           "Screamin' Jay Hawkins"

But if we are only interested in deleting squarebrackets and quotes while leaving other punctuatons, spaces etc then:

trimws(df, whitespace = "[\\]\\[\"']+")
[1] "Mamie Smith"           "Screamin' Jay Hawkins"
Onyambu
  • 67,392
  • 3
  • 24
  • 53
2

Base R:

sapply(regmatches(df, regexec('(\\w.*)(.*\\w)', df)), "[", 1)

[1] "Mamie Smith"           "Screamin' Jay Hawkins"

OR

We could use str_extract from stringr package with this regex:

library(stringr)

str_extract(df, '(\\w.*)(.*\\w)')

[1] "Mamie Smith"           "Screamin' Jay Hawkins"
TarJae
  • 72,363
  • 6
  • 19
  • 66
2

To pair up the square brackets with the accompanying type of quote, you can use:

\[(["'])(.*?)\1]

Explanation

  • \[ Match [
  • (["']) Capture group 1, capture either " or '
  • (.*?) Capture group 2, match as least as possible characters
  • \1 Backreference to group 1 to match the same type of quote
  • ] Match ]

In the replacement use the value of capture group 2 using \\2

Regex demo | R demo

df <- c("['Mamie Smith']", "[\"Screamin' Jay Hawkins\"]")
gsub("\\[([\"'])(.*?)\\1]", "\\2", df)

Output

[1] "Mamie Smith"           "Screamin' Jay Hawkins"
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
2

Another, relatively easy, regex solution is this:

data.frame(df) %>%
  mutate(df = gsub("\\[\\W+|\\W+\\]", "", df))
                     df
1           Mamie Smith
2 Screamin' Jay Hawkins

Here we remove any non-alphanumeric character (\\W+) occurring one or more times on the condition that it be preceded OR (|) followed by a square bracket.

Alternatively, to borrow from @TaerJae but greatly simplified:

library(stringr)
data.frame(df) %>%
  mutate(df = str_extract(df, '\\w.*\\w'))

Here we simply focus on the alphanumeric characters (\\w) on either side of the string, while allowing for any characters (.*) to occur in-between them thus capturing, for example, the apostrophe in Screamin'and the whitespaces.

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
1

since [, ] and " are special characters you need to 'escape' with a double backslash \\

here's some alt code:

gsub('\\"|\\[|\\]', "", df)
CourtesyBus
  • 331
  • 2
  • 4
0

When looking for ] inside [] it need to be on first place []] or esacpe it on other places. Quotes which are used for the string need to be escaped when used inside "[\"]" or '["]'. In the example string are no slashes (here they are only escaping ").

gsub("[]['\"]", "", df)
#[1] "Mamie Smith"          "Screamin Jay Hawkins"

Another option, avoiding escaping " or ' is to use raw character constants r"(...)".

gsub(r"([]["'])", "", df)
#[1] "Mamie Smith"          "Screamin Jay Hawkins"

To limit the search to the borders ^ (begin) and $ (end) need to be given.

gsub("^[]['\"]*|[]['\"]*$", "", df)
#[1] "Mamie Smith"           "Screamin' Jay Hawkins"

or trimws could be used.

trimws(df, "both", "[]['\"]")
#[1] "Mamie Smith"           "Screamin' Jay Hawkins"
GKi
  • 37,245
  • 2
  • 26
  • 48