Fast way for String matching and replacement from another dataframe in R

Question

I have two dataframes that look like this (although the first one is over 90 million rows long and the second dataframe is a little over 14 million rows) Also the second dataframe is randomly ordered

df1 <- data.frame(
  datalist = c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
               "strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
               "the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
               "anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
               "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
               "the first known use of this word was in 1539"),
  words = c("anarchist_schools_of_thought  individualism  collectivism", "social_anarchism  individualist_anarchism",
            "anarchy  -ism", "privative  privative_alpha", "infinitive", ""),

  stringsAsFactors=FALSE)

df2 <- data.frame(
  vocabword = c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",                
                 "anarchy","-ism","privative","privative_alpha", "1310_the_ticket",  "infinitive"),
  token = c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism",  "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
            "-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive"), 
  stringsAsFactors = F)

I was able to extract all the words that come after the phrase "wiki/" into another column. Those words need to be replaced by the token column which matches to vocabword in the second dataframe. So for example I would look at the work "anarchist_schools_of_thought" which comes after wiki/ in the first row of the 1st dataframe, and then find the term "anarchist_schools_of_thought" in the second dataframe under vocab word and I want to replace it with the corresponding token which is "Anarchist_schools_of_thought".

So it should eventually come to look like this:

1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein
6 the first known use of this word was in 1539

I realize that a lot of them just capitalize the first letter of the words but some of them are significantly different. I could do a for loop but I think that would take way too much time and I'd prefer to do this either a data.table way or possibly a stringi or stringr way. And I normally would just do a merge but since there's multiple words needing replaced in a single row, it complicates things.

Thanks in advance for any help.

How is this different from your question from yesterday? https://stackoverflow.com/q/50241313/5325862 — camille, May 09 '18 at 16:59
i need the text replaced. i thought i could figure it out if i separated out some of the text but i've been working on it and have gotten nowhere. — Kayla, May 09 '18 at 17:00
Can you include the code from the previous post + what you've done since? That way we know where you left off and what you need for the next step — camille, May 09 '18 at 17:02
I used this: data$words = trimws(gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE)) and that's literally it. like i said, i could do a for loop but it would be extremely slow. i haven't really gotten anywhere since yesterday — Kayla, May 09 '18 at 17:04
That's fine, but that code is important to include in the body of the question. A loop would definitely be slower than vector operations, so I think the previous post has you on the right track — camille, May 09 '18 at 17:06

acylam · Answer 1 · 2018-05-09T17:09:55.163

2

You can do this with str_replace_all from stringr:

library(stringr)

str_replace_all(df1$datalist, setNames(df2$vocabword, df2$token))

Basically, str_replace_all allows you to supply a named vector with original strings being the names and the replacement being the elements of the vector. You did all the hard work by creating a "dictionary" of strings and replacements. str_replace_all simply took that and do the replacement automatically.

Result:

[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"              
[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"                               
[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Privative_alpha an- i.e"                                              
[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                       
[6] "the first known use of this word was in 1539"

edited May 09 '18 at 17:09

answered May 09 '18 at 17:05

acylam

18,231
5
36
45

is there a stringi variant? I've been running it on 1000 rows for a minute and it's still going – Kayla May 09 '18 at 17:09
@Kayla I notice that "1965-66_nhl_season_by_team" was never in your `datalist`, did you intentionally add that in as a non-match? – acylam May 09 '18 at 17:48
yeah i added that as a non match – Kayla May 09 '18 at 18:00
is there a way to do this if all of df2 is in a totally random order? it's like 14 million rows too – Kayla May 09 '18 at 18:37
@Kayla is it ok for the logic to automatically exclude the non-matches if it doesn't appear in `df1$words`? – acylam May 09 '18 at 18:43
i'm not sure what you mean by that. can you give me an example? – Kayla May 09 '18 at 18:54
@Kayla for example since "1965-66_nhl_season_by_team" didn't appear in `df1$words`, I know that it is a non-match. Is it ok if I come up with a solution that doesn't actually search for a match with "1965-66_nhl_season_by_team" since I know that it is a non-match by definition of `df1$words`? – acylam May 09 '18 at 18:58
yeah that's fine – Kayla May 09 '18 at 19:18

score 0 · Answer 2 · answered May 09 '18 at 17:19

This question has a solution that seems to work well with your data: R: replacing multiple regex with sub

install.packages('qdap')
qdap::mgsub(df2[,1], df2[,2], df1[,1])

[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"              
[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"                               
[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_Privative an- i.e"                                              
[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                       
[6] "the first known use of this word was in 1539"

Matt Summersgill · Answer 3 · 2018-05-10T16:00:24.040

The way I've typically done this using straight stringi is as follows:

library(stringi)

Old <- df2[["vocabword"]]
New <- df2[["token"]]

stringi::stri_replace_all_regex(df1[["datalist"]],
                                "\\b"%s+%Old%s+%"\\b",
                                New,
                                vectorize_all = FALSE)

#[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"              
#[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
#[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"                               
#[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e"                                              
#[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                       
#[6] "the first known use of this word was in 1539"

Theoretically, you should be able to get a reasonable improvement by parallelizing this the right way, but you won't be able to get any better than a Nx speed-up (Where N = # of cores available). -- My hunch is that cutting run-time from something like 8 months down to 15 days still doesn't really help you in a practical sense.

However, if you have 14 million potential replacements to make over 90 million rows, it seems like a fundamentally different approach might be necessary. What's the max number of words in any sentence?

Update: Adding some example code to benchmark potential solutions:

Adding to your additional sentences using stringi::stri_rand_lipsum() and adding additional replacement pairs using stringi::stri_rand_strings() makes it easier to see the effects of an increasing corpus size and vocabulary size on run-time.

With 1,000 sentences:

1,000 replacement pairs: 3.9 seconds.
10,000 replacement pairs: 36.5 seconds.
100,000 replacement pairs: 365.4 seconds.

I'm not going to try 14 million, but this should help you evaluate if alternative methods will scale.

library(stringi)

ExtraSentenceCount <- 1e3
ExtraVocabCount <- 1e4

Sentences <- c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
               "strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
               "the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
               "anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
               "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
               "the first known use of this word was in 1539",
               stringi::stri_rand_lipsum(ExtraSentenceCount))

vocabword <- c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",                
           "anarchy","-ism","privative","privative_alpha", "1310_the_ticket",  "infinitive",
           "a",
           stringi::stri_rand_strings(ExtraVocabCount,
                                      length = sample.int(8, ExtraVocabCount, replace = TRUE),
                                      pattern = "[a-z]"))

token <- c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism",  "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
           "-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive",
           "XXXX",
           stringi::stri_rand_strings(ExtraVocabCount,
                                      length = 3,
                                      pattern = "[0-9]"))

system.time({
  Cleaned <- stringi::stri_replace_all_regex(Sentences, "\\b"%s+%vocabword%s+%"\\b", token, vectorize_all = FALSE)
})

#   user  system elapsed 
# 36.652   0.070  36.768 

head(Cleaned)

# [1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism 749 complete wiki/Collectivism"                
# [2] "strains 454 anarchism have often been divided into the categories 454 wiki/Social_anarchism and wiki/Individualist_anarchism 094 similar dual classifications"
# [3] "the word 412 composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek 190.546"                             
# [4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative 358- 190.546"                                            
# [5] "authority sovereignty realm magistracy and the suffix 094 -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                         
# [6] "the first known use 454 this word was 201 1539"

Update 2: The method below doesn't account for the possibility that you have tags that are sub-strings of another-- i.e. wiki/Individualist and wiki/Individualist_anarchism could give you erroneous results. The only way that I really know to avoid that is using regex/word replacement of full words preceded and followed by word-boundaries (\\b), which can't be based on a fixed string.

One option that might give you hope relies on the fact that you essentially have all of your desired replacements marked with the prefix wiki/. If this is the case for your actual use, then we can take advantage of that and use fixed replacements instead of regex replacement of full words preceded and followed by word-boundaries (\\b). (this is necessary to avoid a vocab word like "ism" being replaced when it occurs as part of a longer word)

Using the same list as above:

prefixed_vocabword <- paste0("wiki/",vocabword)
prefixed_token <- paste0("wiki/",token)

system.time({
  Cleaned <- stringi::stri_replace_all_fixed(Sentences, prefixed_vocabword, prefixed_token, vectorize_all = FALSE)
})

~~That cuts run-time down to 10.4 seconds with 1,000 sentences and 10,000 replacements, but since the run-time is still grows linearly this would still take hours for your data size.~~

score 0 · Answer 4 · answered May 10 '18 at 15:44

Because each of your terms begins with "wiki/" it is possible to re-arrange your dataset to make it MUCH easier to create matches. The method I'm putting forward is to move each "wiki/term" to its own row of the dataframe, use a join to match up words which is efficient, and then reverse the steps to put the strings back together the way they were but with the new terms in them.

library(tidyverse)
df1a <- df1 %>%
  # Create a separator character to identify where to split
  mutate(datalist = str_replace_all(datalist,"wiki/","|wiki/")) %>% 
  mutate(datalist = str_remove(datalist,"^\\|"))

  # Split so that each instance gets its own column
df1a <- 
  str_split(df1a$datalist,"\\|",simplify = TRUE) %>% 
  as.tibble() %>% 
  # Add a rownum column to keep track where to put back together for later
  mutate(rownum = 1:n()) %>% 
  # Gather the dataframe into a tidy form to prepare for joining
  gather("instance","text",-rownum,na.rm = TRUE) %>% 
  # Create a column for joining to the data lookup table
  mutate(keyword = text %>% str_extract("wiki/[^ ]+") %>% str_remove("wiki/")) %>% 
  # Join the keywords efficiently using left_bind
  left_join(df2,by = c("keyword" = "vocabword")) %>% 
  # Put the results back into the text string
  mutate(text = str_replace(text,"wiki/[^ ]+",paste0("wiki/",token))) %>%
  select(-token,-keyword) %>% 
  # Spread the data back out to the original number of rows
  spread(instance,text) %>% 
  # Re-combine the sentences/strings to their original form
  unite("datalist",starts_with("V"),sep="") %>%
  select("datalist")

Results:

# A tibble: 6 x 1
  datalist                                                                                                 
  <chr>                                                                                                    
1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individ~
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Indiv~
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively~
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an-~
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive su~
6 the first known use of this word was in 1539

1,020,000 observations in df1 (6*170k) and 12 observations in df2 (no change) results in a run time of 3.29 minutes on my laptop. Obviously, adding terms to df2 will increase the amount of time this uses. But the only increase in time will be for the left_join statement which is fairly efficient. The other lines of code won't change time required if df2 is increased in size. — Adam Sampson, May 10 '18 at 15:58

Fast way for String matching and replacement from another dataframe in R

4 Answers4