I have to find indices for 1MM numeric values within a vector of roughly 10MM values. I found the package fastmatch
, but when I use the function fmatch()
, I am only returning the index of the first match.
Can someone help me use this function to find all values, not just the first? I realize this is a basic question but online documentation is pretty sparse and fmatch
has cut down the computing time considerably.
Thanks so much!
Here is some sample data - for the purposes of this exercise, let's call this data frame A:
DateTime Address Type ID
1 2014-03-04 20:21:03 982076970 1 2752394
2 2014-03-04 20:21:07 98174238211 1 2752394
3 2014-03-04 20:21:08 76126162197 1 2752394
4 2014-03-04 20:21:16 6718053253 1 2752394
5 2014-03-04 20:21:17 98210219176 1 2752510
6 2014-03-04 20:21:20 7622877100 1 2752510
7 2014-03-04 20:21:23 2425126157 1 2752510
8 2014-03-04 20:21:23 2425126157 1 2752510
9 2014-03-04 20:21:25 701838650 1 2752394
10 2014-03-04 20:21:27 98210219176 1 2752394
What I wish to do is to find the number of unique Type
values for each Address
. There are several million rows of data with roughly 1MM unique Address values... on average, each Address appears about 6 times in the data set. And, though the Type
values listed above are all 1, they can take any value from 0:5. I also realize the Address
values are quite long, which adds to the time required for the matching.
I have tried the following:
uvals <- unique(A$Address)
utypes <- matrix(0,length(uvals),2)
utypes[,1] <- uvals
for (i in 1:length(unique(Address))) {
b <- which(uvals[i] %in% A$Address)
c <- length(unique(A$Type[b]))
utypes[i,2] <- c
}
However, the code above is not very efficient - if I am looping over 1MM values, I estimate this will take 10-15 hours.
I have tried this, as well, within the loop... but it is not considerably faster.
b <- which(A$Address == uvals[i])
I know there is a more elegant/faster way, I am fairly new to R and would appreciate any help.