I am performing text analysis and trying to do some cleaning.
I need to remove the following terms (if they occur) from a single element vector in R. The sample text and replacement I developed works well:
library(stringr)
original.v <- c("Every dog has its day")
remove.v <- c("Every", "has", "day")
gsub(x = original.v, pattern = paste(remove.v, collapse = "|"), replacement = "")
[1] " dog its "
However when I attempt to apply this to my data (The actual text is much longer. However this sample yields the same result):
D2020 <- c("BIDEN: . . . he knows what the outcome will be. BIDEN: . . . If you knew anything about. . . BIDEN: . . . we’re going to be in a position where we can create hard, hard, good jobs by making sure the")
with:
library(stringr)
remove <- c("WALLACE:", "BIDEN:", "TRUMP:", "[crosstalk]", "[crosstalk]-",
"[to Biden]", "[laughing]", "[unintelligible]", "WELKER:")
gsub(D2020, pattern = paste(remove, collapse = "|"), replacement = "")
I get:
"...wwmw....Ifywy......w’pww,,jym"
Instead of:
. . . he knows what the outcome will be. BIDEN: . . . If you knew anything about. . . . . . we’re going to be in a position where we can create hard, hard, good jobs by making sure the")
I've attempted to research this on the Internet. However nobody has encountered this error.
Note that my original loading syntax was as follows:
sample <- read.table("C:/Users/jesse/Desktop/Election/Sample.txt", header=FALSE,
encoding = "UTF-8")