R - gsub paste combination returns gibberish

Question

I am performing text analysis and trying to do some cleaning.

I need to remove the following terms (if they occur) from a single element vector in R. The sample text and replacement I developed works well:

library(stringr)

original.v <- c("Every dog has its day")
remove.v <- c("Every", "has", "day")

gsub(x = original.v, pattern = paste(remove.v, collapse = "|"), replacement = "")

[1] " dog  its "

However when I attempt to apply this to my data (The actual text is much longer. However this sample yields the same result):

D2020 <- c("BIDEN: . . . he knows what the outcome will be. BIDEN: . . . If you knew anything about. . . BIDEN: . . . we’re going to be in a position where we can create hard, hard, good jobs by making sure the")

with:

library(stringr)

remove <- c("WALLACE:", "BIDEN:", "TRUMP:", "[crosstalk]", "[crosstalk]-", 
            "[to Biden]", "[laughing]", "[unintelligible]", "WELKER:")

gsub(D2020, pattern = paste(remove, collapse = "|"), replacement = "")

I get:

"...wwmw....Ifywy......w’pww,,jym"

Instead of:

. . . he knows what the outcome will be. BIDEN: . . . If you knew anything about. . . . . . we’re going to be in a position where we can create hard, hard, good jobs by making sure the")

I've attempted to research this on the Internet. However nobody has encountered this error.

Note that my original loading syntax was as follows:

sample <- read.table("C:/Users/jesse/Desktop/Election/Sample.txt", header=FALSE, 
                     encoding = "UTF-8")

You need to escape your `remove` items first, see [`regex.escape` function](https://stackoverflow.com/a/60966689/3832970) here. E.g., `[to Biden]` matches `t`, `o`, space, `B`, `i`, `d`, `e` and `n`, not `[to Biden]` string. — Wiktor Stribiżew, Mar 30 '21 at 20:48
@WiktorStribiżew Does this apply to the text I am attempting to run the function on too? — Englishman Bob, Mar 30 '21 at 20:50
It applies to any literal text you want to use as part of a regex pattern. Escape it first. — Wiktor Stribiżew, Mar 30 '21 at 20:51

R - gsub paste combination returns gibberish

0 Answers0