2

I have to find indices for 1MM numeric values within a vector of roughly 10MM values. I found the package fastmatch, but when I use the function fmatch(), I am only returning the index of the first match.

Can someone help me use this function to find all values, not just the first? I realize this is a basic question but online documentation is pretty sparse and fmatch has cut down the computing time considerably.

Thanks so much!


Here is some sample data - for the purposes of this exercise, let's call this data frame A:

              DateTime     Address       Type     ID
1  2014-03-04 20:21:03   982076970          1  2752394
2  2014-03-04 20:21:07 98174238211          1  2752394
3  2014-03-04 20:21:08 76126162197          1  2752394
4  2014-03-04 20:21:16  6718053253          1  2752394
5  2014-03-04 20:21:17 98210219176          1  2752510
6  2014-03-04 20:21:20  7622877100          1  2752510
7  2014-03-04 20:21:23  2425126157          1  2752510
8  2014-03-04 20:21:23  2425126157          1  2752510
9  2014-03-04 20:21:25   701838650          1  2752394
10 2014-03-04 20:21:27 98210219176          1  2752394

What I wish to do is to find the number of unique Type values for each Address. There are several million rows of data with roughly 1MM unique Address values... on average, each Address appears about 6 times in the data set. And, though the Type values listed above are all 1, they can take any value from 0:5. I also realize the Address values are quite long, which adds to the time required for the matching.

I have tried the following:

uvals <- unique(A$Address)
utypes <- matrix(0,length(uvals),2)
utypes[,1] <- uvals

for (i in 1:length(unique(Address))) {
    b <- which(uvals[i] %in% A$Address)
    c <- length(unique(A$Type[b]))
    utypes[i,2] <- c
}

However, the code above is not very efficient - if I am looping over 1MM values, I estimate this will take 10-15 hours.

I have tried this, as well, within the loop... but it is not considerably faster.

b <- which(A$Address == uvals[i])  

I know there is a more elegant/faster way, I am fairly new to R and would appreciate any help.

Arun
  • 116,683
  • 26
  • 284
  • 387
Dan_Alexander
  • 369
  • 2
  • 4
  • 7

4 Answers4

6

This can be done using unique function in data.table, followed by an aggregation. I'll illustrate it using more or less the sample data generated by @Chinmay:

Create sample data:

set.seed(100L)
dat = data.frame(
         address = sample(1e6L, 1e7L, TRUE), 
           value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
      )

data.table solution:

require(data.table) ## >= 1.9.2
dat.u = unique(setDT(dat), by=c("address", "value"))
ans   = dat.u[, .N, by=address]

Explanation:

  • The setDT function converts a data.frame to data.table by reference (which is very fast).
  • unique function operated on a data.table evokes the unique.data.table method, which is incredibly fast compared to base:::unique. Now, we've only unique values of type for every address.
  • All that's left to do is to aggregate or group-by address and get the number of observations that are there in each group. The by=address part groups by address and .N is an in-built data.table variable that provides the number of observations for that group.

Benchmarks:

I'll create functions to generate data as data.table and data.frame to benchmark data.table answer againstdplyr solution (a) proposed by @beginneR, although I don't see the need for arrange(.) there and therefore will skip that part.

## function to create data
foo <- function(type = "df") {
    set.seed(100L)
    dat = data.frame(
             address = sample(1e6L, 1e7L, TRUE), 
               value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
          )
    if (type == "dt") setDT(dat)
    dat
} 

## DT function
dt_sol <- function(x) {
    unique(x, by=c("address", "value"))[, .N, by=address]
}

## dplyr function
dplyr_sol <- function(x) {
    distinct(x) %>% group_by(address) %>% summarise(N = n_distinct(value))
}

The timings reported here are three consecutive runs of system.time(.) on each function.

## benchmark timings in seconds
##        pkg   run-01   run-02   run-03                                 command
## data.table     2.4       2.3      2.4  system.time(ans1 <- dt_sol(foo("dt")))
##      dplyr    15.3      16.3     15.7   system.time(ans2 <- dplyr_sol(foo()))

For some reason, dplyr automatically orders the result by the grouping variable. So in order to compare the results, I'll also order them in the result from data.table:

system.time(setkey(ans1, address)) ## 0.102 seconds
identical(as.data.frame(ans1), as.data.frame(ans2)) ## TRUE

So, data.table is ~6x faster here.

Note that bit64:::integer64 is also supported in data.table - since you mention the address values are too long, you can also store them as integer64.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • @BrodieG, it's the new operator from `magrittr`: github.com/hadley/dplyr/blob/master/NEWS.md – Arun May 15 '14 at 16:58
  • I get the difference down to 6x using `function(x) .Internal(unique(x, FALSE, FALSE, NA))` instead of `unique` in the `dplyr` solution, but still a substantial difference. – BrodieG May 15 '14 at 17:17
  • I cannot find anything about the setDT function nor could I get it to run in R - any documentation you could provide would be appreciated. Thanks Arun. – Dan_Alexander May 15 '14 at 20:20
  • Other than the questioner specifically asking for a "fastmatch" solution, it is good to know that the data.table indexing is so fast. One thing that is less clear with fastmatch is the life cycle of the hashes. DT, I'm guessing, follows the usual R scoping/GC rules. The good thing about fastmatch is that it can in many cases be a drop-in replacement, where DT brings in a completely new syntax, as if R isn't already quirky enough. I know this doesn't answer the question, but this whole discussion highlights a major area of inefficiency in core R that I am struggling with. – Jack Wasey Aug 22 '16 at 17:36
3

You can try creating an index of your 10MM values and sort that. Then looking for your 1MM values in that indexed vector should be faster.

For example, using data.table package you can do that by using setkey function which indexes given column of data.table.

require(data.table)

set.seed(100)

dat <- sample(1:1e+07, size = 1e+07, replace = T)
searchval <- sample(dat, size = 1e+06)

DT <- data.table(dat, index = seq_along(dat))
setkey(DT, dat)
DT
##                dat   index
##        1:        1  169458
##        2:        1 4604823
##        3:        1 7793446
##        4:        2 5372388
##        5:        3 2036622
##       ---                 
##  9999996:  9999996 1271426
##  9999997:  9999998  530029
##  9999998: 10000000  556672
##  9999999: 10000000 6776063
## 10000000: 10000000 6949665


lookup <- data.table(val = searchval)
setkey(lookup, val)
lookup
##              val
##       1:       2
##       2:      16
##       3:      24
##       4:      33
##       5:      36
##      ---        
##  999996: 9999970
##  999997: 9999973
##  999998: 9999988
##  999999: 9999996
## 1000000: 9999998

Now you can lookup all the values from lookup in DT by simply using

DT[lookup]
##              dat   index
##       1:       2 5372388
##       2:      16  537927
##       3:      16 1721233
##       4:      24 7286522
##       5:      33 7448516
##      ---                
## 2000298: 9999973 8008610
## 2000299: 9999988 3099060
## 2000300: 9999988 7996302
## 2000301: 9999996 1271426
## 2000302: 9999998  530029
CHP
  • 16,981
  • 4
  • 38
  • 57
  • This was quite helpful - thank you very much for the reply. However, I am not quite sure of the speed gains - I presume I still have to loop through the unique Address values and look them up through DT[lookup]. Running a loop over 1% of my data took 5 minutes... extrapolating linearly, we are still talking ~ 8 hours. – Dan_Alexander May 15 '14 at 15:15
1

fmatch seems to clearly state that it only finds the first match. And given that it uses an underlying hashing strategy, I imagine it's unlikely that it stores multiple items per key which is one of the ways it stays so fast (and it's the same way match works).

Do you have many duplicate values? Perhaps you could store those in a separate place/table and create a fast index to a list of possible matches. It would be more helpful if you provided sample data representative of what you're trying to do and the code you tried to see if it would be easy to extend.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
0

If I understand your question correctly, you can also do this with dplyr:

I will include two different ways, since I am not entirely sure which is your desired output.

First create some sample data:

Address <- rep(letters, 5)
Type <- sample(1:5, size=5*26, replace=T)
A <- data.frame(Address, Type)

Then install and load dplyr

require(dplyr)

a) To find the number of different Type values for each Address value:

A %.% arrange(Address, Type) %.% group_by(Address) %.% summarize(NoOfTypes = length(unique(Type)))

b) To find all unique combinations of Address and Type:

A %.% arrange(Address, Type) %.% group_by(Address, Type) %.% filter( 1:n() == 1)
talat
  • 68,970
  • 21
  • 126
  • 157
  • The question was "Can someone help me use this function to find all values, not just the first?" Where "this function" is "fmatch." – Jack Wasey Aug 22 '16 at 17:32