I have two dataframes that look like this (although the first one is over 90 million rows long and the second dataframe is a little over 14 million rows) Also the second dataframe is randomly ordered
df1 <- data.frame(
datalist = c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
"strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
"the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
"anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
"authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
"the first known use of this word was in 1539"),
words = c("anarchist_schools_of_thought individualism collectivism", "social_anarchism individualist_anarchism",
"anarchy -ism", "privative privative_alpha", "infinitive", ""),
stringsAsFactors=FALSE)
df2 <- data.frame(
vocabword = c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",
"anarchy","-ism","privative","privative_alpha", "1310_the_ticket", "infinitive"),
token = c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism", "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
"-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive"),
stringsAsFactors = F)
I was able to extract all the words that come after the phrase "wiki/" into another column. Those words need to be replaced by the token column which matches to vocabword in the second dataframe. So for example I would look at the work "anarchist_schools_of_thought" which comes after wiki/ in the first row of the 1st dataframe, and then find the term "anarchist_schools_of_thought" in the second dataframe under vocab word and I want to replace it with the corresponding token which is "Anarchist_schools_of_thought".
So it should eventually come to look like this:
1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein
6 the first known use of this word was in 1539
I realize that a lot of them just capitalize the first letter of the words but some of them are significantly different. I could do a for loop but I think that would take way too much time and I'd prefer to do this either a data.table way or possibly a stringi or stringr way. And I normally would just do a merge but since there's multiple words needing replaced in a single row, it complicates things.
Thanks in advance for any help.