5

I'm trying to use str_replace_all to replace many different values (i.e. "Mod", "M2", "M3", "Interviewer") with one the consistent string (i.e. "Moderator:"). I'm doing this with multiple different categories, and I want avoid having to write each unique value out as there are a lot.

So I made a tibble consisting of all the unique values that I want to make standardized and read it in and then pulled out each column (there are 5 but only 2 shown for simplicity) to make them into vectors:

speak_names <- read_csv("speak_names.csv")
speak_namesMisc <- dplyr::pull(speak_names, Misc)
speak_namesMod <- dplyr::pull(speak_names, Moderator)

For the replacement value, I made a character vector of equal length to those above vectors because I know that the replacement and pattern must be equal lengths:

Misc <- rep("Misc:", 2)
Mod <- rep("Moderator:", 28)

When I run Misc through with this code, it works just fine:

atas_clean$speaker <- str_replace_all(atas_clean$speaker, speak_namesMisc, Misc)

But when I try the identical Moderator version (even if I attempt to run it before Misc), I get an error message:

atas_clean$speaker <- str_replace_all(atas_clean$speaker, speak_namesMod, 
Mod)

Warning message:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement),  :
longer object length is not a multiple of shorter object length

I don't know why I'm getting this error because this identical function yields TRUE:

identical(length(speak_namesMod), length(Mod))

The dataframe that I'm working with is 16,244 lines long if that makes any difference to the pattern or replacement. I'm stuck and trying to find out why this isn't working and/or another solution that does not involve typing out each character element in the vectors.

Thank you!

J.Sabree
  • 2,280
  • 19
  • 48
  • 1
    I don't understand why the length of `Misc` is 2 and the length of `Mod` is 28. Shouldn't they be the same? Also can you post what `dput(head(speak_names))` returns? I don't have a sense for what the data looks like. – johnckane Jun 13 '18 at 17:38
  • they can be of unequal lengths but repeated.. – Onyambu Jun 13 '18 at 18:18

1 Answers1

11
library('dplyr') # load the dplyr package
library('stringr') # load the stringr package

#Here is a sample of my own dataset to answer your question dput() of my data gives

abc<-as.data.frame(
structure(list(Name = c("ME-9_ 005", "ME-9_ 004", "ME-9_ 003", 
                        "ME-9_ 002", "ME-9_ 001", "ME-9_ 000", "ME-8_ 005", "ME-8_ 004", 
                        "ME-8_ 003", "ME-8_ 002", "ME-8_ 001", "ME-8_ 000", "ME-7_ 005", 
                        "ME-7_ 004", "ME-7_ 003", "ME-7_ 002", "ME-7_ 001", "ME-7_ 000"
), Mg = c(0.411058647473409, 0.361611969040526, 0.435757145931429, 
          0.36656632349025, 0.312782034685408, 0.357913661160629, 0.414639893651842, 
          0.460992875568015, 0.554803107534663, 0.418743792959099, 0.499114614445091, 
          0.475374442706501, 0.564660334010035, 0.502678818989733, 0.417617035801997, 
          0.488463005872639, 0.484776757286094, 0.424850010858818),
Al = c(0.575667101719941,  0.586351493923602, 0.574053324307634, 0.628497798862674, 0.552234153060378, 
       0.580547408629286, 1.05746950789483, 1.07094531357244, 1.11340157804305, 
       1.03043684466386, 1.02899468191215, 1.07222457991059, 1.5276908007952, 
       1.66549994904359, 1.43287302441973, 1.37434198093964, 1.55835986529032, 
       1.66902429579112), 
Si = c(0.495188340689301, 0.513374456164654, 
       0.51809643007659, 0.569128515813393, 0.542590350648068, 0.516673370168739, 
       1.72437228079744, 1.59076392020817, 1.77327433861292, 1.76671780355934, 
       1.60625706442694, 1.92449284567535, 3.27248599245035, 3.23739024834759, 
       2.84115179036218, 2.51112086010829, 2.98829002803169, 2.93347114563903
), 
P = c(0.222881184902066, 0.258237982165306, 0.230235867213535, 
      0.262379290809071, 0.230438623604524, 0.238615393939999, 0.260241811918024, 
      0.238785817517132, 0.248589968755681, 0.248270048794532, 0.272489046130942, 
      0.266707140244041, 0.25935282543278, 0.258801008935983, 0.250692297246152, 
      0.246890941447243, 0.277698144829677, 0.274197618349091)), 
row.names = c(NA, 
              -18L), class = c("tbl_df", "tbl", "data.frame")))

#here is how my data looked before cleaning

head(abc,10)

Before

But for your specific question, you should do

abc$Name <- str_replace_all(
  abc$Name, # column we want to search
  c("001" = "","002" = "","003" = "","004" = "","005" = "","000" = "",
    "-" = " ","_" = "") # each string schould be matched with a replacement
)

#here is how my data looked after cleaning

head(abc,10)

After

I hope this helps

Hammao
  • 801
  • 1
  • 9
  • 28