Binary search for integer64 in data.table

Question

I have a integer64 indexed data.table object:

library(data.table)
library(bit64)

some_data = as.integer64(c(1514772184120000026, 1514772184120000068, 1514772184120000042, 1514772184120000078,1514772184120000011, 1514772184120000043, 1514772184120000094, 1514772184120000085,
1514772184120000083, 1514772184120000017, 1514772184120000013, 1514772184120000060, 1514772184120000032, 1514772184120000059, 1514772184120000029))

#
n <- 10
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)

Then I have a bunch of other integer64 that I need to binary-search for in the indexes of my original data.table object (x):

search_values <- some_data[(n+1):length(some_data)]

If these where native integers I could use findInterval() to solve the problem:

values_index  <- findInterval(search_values, x$new_col)

but when the arguments to findInterval are integer64, I get:

Warning messages:
1: In as.double.integer64(vec) :
  integer precision lost while converting to double
2: In as.double.integer64(x) :
  integer precision lost while converting to double

and wrong indexes:

> values_index
[1] 10 10 10 10 10

e.g. it is not true that the entries of search_values are all larger than all entries of x$new_col.

Edit:

Desired output:

print(values_index)
9 10  6 10  1

Why?:

value_index has as many entries as search_values. For each entries of search_values, the corresponding entry in value_index gives the rank that entry of search_values would have if it where inserted inside x$new_col. So the first entry of value_index is 9 because the first entry of search_values (1514772184120000045) would have rank 9 among the entries of x$new_col.

can you edit your example with fixed `now` variable and show your desired output? — minem, Aug 08 '18 at 11:04

score 1 · Answer 1 · answered Aug 08 '18 at 13:42

If I get what you want, then a quick workaround could be:

toadd <- search_values[!(search_values %in% x$new_col)] # search_values that is not in data
x[, i := .I] # mark the original data set
x <- rbindlist(list(x, data.table(new_col = toadd)),
               use.names = T, fill = T) # add missing search_values
setkey(x, new_col) # order
x[, index := new_col %in% search_values] # mark where the values are
x[, index := cumsum(index)] # get indexes
x <- x[!is.na(i)] # remove added rows
x$index # should contain your desired output

Sorry I had to run and could just see your comment now. Saddly `x$index` is not what I am after. I edited the question. — user189035, Aug 08 '18 at 15:51

minem · Accepted Answer · 2018-08-09T11:45:11.487

Maybe you want something like this:

findInterval2 <- function(y, x) {
  toadd <- y[!(y %in% x$new_col)] # search_values that is not in data
  x2 <- copy(x)
  x2[, i := .I] # mark the original data set
  x2 <- rbindlist(list(x2, data.table(new_col = toadd)),
                  use.names = T, fill = T) # add missing search_values
  setkey(x2, new_col) # order
  x2[, index := cumsum(!is.na(i))]
  x2[match(y, new_col), index]
}
# x2 is:
#              a             new_col  i index
#  1: 0.56602278 1514772184120000011  1     1
#  2:         NA 1514772184120000013 NA     1
#  3: 0.29408237 1514772184120000017  2     2
#  4: 0.28532378 1514772184120000026  3     3
#  5:         NA 1514772184120000029 NA     3
#  6:         NA 1514772184120000032 NA     3
#  7: 0.66844754 1514772184120000042  4     4
#  8: 0.83008829 1514772184120000043  5     5
#  9:         NA 1514772184120000059 NA     5
# 10:         NA 1514772184120000060 NA     5
# 11: 0.76992760 1514772184120000068  6     6
# 12: 0.57049677 1514772184120000078  7     7
# 13: 0.14406169 1514772184120000083  8     8
# 14: 0.02044602 1514772184120000085  9     9
# 15: 0.68016024 1514772184120000094 10    10
findInterval2(search_values, x)
# [1] 1 5 3 5 3

If not, then maybe you could change the code as needed.

update

look at this integer example to see that this function gives the same result as base findInterval

now <- 10
n <- 10
n2 <- 10
some_data = as.integer(now + sample.int(n + n2, n + n2))
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
search_values <- some_data[(n + 1):length(some_data)]

r1 <- findInterval2(search_values, x)
r2 <- findInterval(search_values, x$new_col)
all.equal(r1, r2)

Thanks, but as you can see the output is not the expected one. — user189035, Aug 09 '18 at 11:35
@user189035 look at your example data then once more! there is not 1514772184120000045 value. — minem, Aug 09 '18 at 11:39
to see the problem do the following changes in your test code: `now<- as.integer64(1514772184120000026)`; `some_data = as.integer64(now + sample.int(n + n2, n + n2))`. In other words: the problem is when the data is in `integer64` format. I'm sure `findInterval/findInterval2` work with the native R data formats — user189035, Aug 09 '18 at 12:13
@ no, you do not understand what I am saying. Currently findInterval2 should work for integer64 and integer. Test it out. the example of integers is only to show that the function does what findInterval do. but maybe you all along did want some kind of different results? — minem, Aug 09 '18 at 13:13
sorry. Please gimme a bit of time to evaluate your answer (I moved to a different sub-task: I will come back to this asap) — user189035, Aug 09 '18 at 13:53

Binary search for integer64 in data.table

Edit:

2 Answers2

update