Quick search in data.table or quick subset

Question

I have a DF with 800k+ rows with repeated (random) values. For each row I need to take a value and find an index of a new row(s) with same value. E.g. "asd" - where else do I see it? The index of the current row is NOT needed.

My current solution: subset a DF and create a temp frame/table by removing current row. Problem - it takes a minute per 1000 iterations. So 800+k rows will take me 13 hours to run. Any ideas? Thanks!

Running on original DF (not subsetted) is < 1 second, but as you can imagine it gives me the index of the current row.

Edit: My real-life DF is more than 1 column. Example below is simplified. I need to take V1[1] and get row numbers of other V1 with value of V1[1], then repeat for V1[2] and so on for each row

library(fastmatch)
library(stringi)
set.seed(12345)
V1 = stringi::stri_rand_strings(800000, 3)
df0 = as.data.table(V1)
mapped = matrix("",nrow=800000)

print(Sys.time())
for (i in 1:1000) {
  tmp_df = df0[-i,] #This takes very long time!!!
  mapped[i] = fmatch(df0$V1[i],tmp_df$V1)
}
print(Sys.time())

View(mapped)

It is really unclear to understand what you're trying to do: _"For each row I need to take a value"_ what is "a value"? You essentially have a dataframe with only one column, so there isn't "a value", rather you only have one entry per row and only one. _"where else do I see it?"_ where else do you see what? The same string? — gented, Mar 14 '18 at 01:39
I need to take `V1[1]` and get row numbers of other `V1` with value of `V1[1]`, and so on for each row — Alexey Ferapontov, Mar 14 '18 at 01:41

Sathish · Accepted Answer · 2018-03-18T20:30:58.937

2

Data:

library("data.table")
set.seed(12345)
V1 = stringi::stri_rand_strings(80, 3)
df0 <- data.table( sample(V1, 100, replace = TRUE ))

Code:

df0[, id := list(list(.I)), by = V1]  # integer id

Output:

head(df0, 10)
#     V1          id
# 1: iuR      1,2,21
# 2: iuR      1,2,21
# 3: KXc           3
# 4: LwA           4
# 5: pYn           5
# 6: qoN        6,66
# 7: 5Xt           7
# 8: wBH        8,77
# 9: V9r     9,39,54
# 10: 9ks 10,28,42,48

EDIT - Removed Current Index:

df0[, id2 := 1:.N ]
df0[, id := list(list(unlist(id)[ unlist(id) != .I  ] )), by = id2 ]
df0[, id2 := NULL ]
df0[ lengths(id) > 0, ]
head( df0, 10 )
#     V1       id
# 1: iuR     2,21
# 2: iuR     1,21
# 3: KXc         
# 4: LwA         
# 5: pYn         
# 6: qoN       66
# 7: 5Xt         
# 8: wBH       77
# 9: V9r    39,54
# 10: 9ks 28,42,48

edited Mar 18 '18 at 20:30

answered Mar 14 '18 at 02:03

Sathish

12,453
3
41
59

Almost. Current index is not needed, i.e. 3rd row doesn't need "3" in `rowID` – Alexey Ferapontov Mar 14 '18 at 02:07
My seed was messed up. I mean - look at row 1, `iuR`. The code found two more rows in DF with `iuR` AND the original row 1. But it didn't happen for rows 5 or 7.. hmm – Alexey Ferapontov Mar 14 '18 at 02:11
Oh, almost almost, Sathish! Row 97: `97 TbJ 12,73,86,97`. That last 97 is not needed as it repeats the current row – Alexey Ferapontov Mar 14 '18 at 02:19
Yes. I guess it will be more efficient if I unlist and remove current index after that step. Thank you!!! – Alexey Ferapontov Mar 14 '18 at 02:57

Quick search in data.table or quick subset

1 Answers1