I have a DF with 800k+ rows with repeated (random) values. For each row I need to take a value and find an index of a new row(s) with same value. E.g. "asd" - where else do I see it? The index of the current row is NOT needed.
My current solution: subset a DF and create a temp frame/table by removing current row. Problem - it takes a minute per 1000 iterations. So 800+k rows will take me 13 hours to run. Any ideas? Thanks!
Running on original DF (not subsetted) is < 1 second, but as you can imagine it gives me the index of the current row.
Edit: My real-life DF is more than 1 column. Example below is simplified. I need to take V1[1]
and get row numbers of other V1
with value of V1[1]
, then repeat for V1[2]
and so on for each row
library(fastmatch)
library(stringi)
set.seed(12345)
V1 = stringi::stri_rand_strings(800000, 3)
df0 = as.data.table(V1)
mapped = matrix("",nrow=800000)
print(Sys.time())
for (i in 1:1000) {
tmp_df = df0[-i,] #This takes very long time!!!
mapped[i] = fmatch(df0$V1[i],tmp_df$V1)
}
print(Sys.time())
View(mapped)