A C++ version of the %in% operator in R

Question

Is there any function in C++ equivalent to %in% operator in R? Consider the command below in R:

which(y %in% x)

I tried to find something equivalent in C++ (specifically in Armadillo) and I couldn't find anything. I then wrote my own function which is very slow compared to the R command above.

Here is what I wrote:

#include <RcppArmadillo.h>
// [[Rcpp::depends("RcppArmadillo")]]

// [[Rcpp::export]]
arma::uvec myInOperator(arma::vec myBigVec, arma::vec mySmallVec ){
 arma::uvec rslt = find(myBigVec == mySmallVec[0]);
 for (int i = 1; i < mySmallVec.size(); i++){
   arma::uvec rslt_tmp = find(myBigVec == mySmallVec[i]);
   rslt = arma::unique(join_cols( rslt, rslt_tmp ));
 }
 return rslt;
}

Now after sourcing in the code above, we have:

x <- 1:4
y <- 1:10
res <- benchmark(myInOperator(y, x), which(y %in% x), columns = c("test",
      "replications", "elapsed", "relative", "user.self", "sys.self"), 
       order = "relative")

And here are the results:

                 test replications elapsed relative user.self sys.self
 2    which(y %in% x)          100   0.001        1     0.001        0
 1 myInOperator(y, x)          100   0.002        2     0.001        0

Could anyone guide me either on finding a C++ code corresponding to which(y %in% x) or on making my code more efficient? The elapsed time is already very small for both functions. I guess what I meant by efficiency is more from programming perspective and on whether the way I thought about the problem and the commands I used are efficient.

I appreciate your help.

It will be hard to beat `which(y %in% x)`, since `%in%` (calling `match`) and `which` are already `.Internal` functions, and hence implemented in C and likely optimized. It is possible that one could improve performance by avoiding the temporary logical vector generated by `%in%` though. — Kevin Ushey, Jan 26 '14 at 04:09
It would probably help a little if you pass stuff by reference instead of making copies, `(const arma::vec & myBigVec, const arma::vec & mySmallVec )` — OneOfOne, Jan 26 '14 at 04:20
I dont't know enough about RCpp (and nothing of Armidillo) so I can't answer the question. However, if I were doing this in C++, I would look to `std::set_intersection`. — Matthew Lundberg, Jan 26 '14 at 04:21
@KevinUshey, would you happen to know a way to use %in% in C++ by using Rcpp? Is there a way to call which and %in%. While I can easily call other R functions, I was not successful to use which and I have no idea how to use %in%. I really appreciate your help. — Sam, Jan 26 '14 at 04:21
For a question like this (how to duplicate/imitate feature X of language Y in language Z) it's *extremely* helpful if you describe exactly what the feature does--or at least the subset of its capabilities you care about. — Jerry Coffin, Jan 26 '14 at 04:36
@JerryCoffin `y %in% x` returns a logical vector the same length as `y`, value `TRUE` if the element of `y` is in `x`, else `FALSE`. `which` returns the numeric indices (1-based) of the `TRUE` values in a vector. — Matthew Lundberg, Jan 26 '14 at 04:39
@matthewlundberg what is a 'logical vector'? How is it distinct from a `std::vector`? Is it lazy? Do you want `%in%` and `which`, or just `which(a %in% b)`? — Yakk - Adam Nevraumont, Jan 26 '14 at 05:01
@Yakk The question includes the R tag. These of course are constructs of R. A "logical vector" in R only relates to a `std::vector` when RCpp is involved. Perhaps this question should not have the C++ tag applied, as R and RCpp are a bit different. But the message here is, if a question has tags about a language with which you are unfamiliar, you might refrain from answering or commenting. That's what I do with answers with tags in languages that I do not know. — Matthew Lundberg, Jan 26 '14 at 05:06
@matthewlundberg except the OP wants C++ code: I can produce that code if the specs are clear. The set of people competent at two random languages is going to be much smaller: if, as the below answer (now deleted) implies, a pure C++ algorithm on `std::vector` suffices, the `R` tag becomes redundant! — Yakk - Adam Nevraumont, Jan 26 '14 at 05:12
@Yakk But it's not pure! It relies on RCpp and it must be callable from the R interpreter. If you can do that, great. If not, well... — Matthew Lundberg, Jan 26 '14 at 05:13
For the purposes of this question, a pure C++ solution could easily be made callable using `Rcpp`, but it would not necessarily be the fastest due to data copies forced in going from R containers to C++ containers. There are ways around this, but it is not the default behavior. — Kevin Ushey, Jan 26 '14 at 05:34
@MatthewLundberg: We tend to spell it Rcpp (lower-case c) and have done since 2008. — Dirk Eddelbuettel, Jan 26 '14 at 17:07

Kevin Ushey · Accepted Answer · 2014-01-27T20:41:21.390

EDIT: Thanks to @MatthewLundberg and @Yakk for catching my silly errors.

If what you really want is just faster matching, you should check out Simon Urbanek's fastmatch package. However, Rcpp does in fact have a sugar in function which can be used here. in uses some of the ideas from the fastmatch package and incorporates them into Rcpp. I also compare @hadley's solution here.

// [[Rcpp::plugins("cpp11")]]
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::vector<int> sugar_in(IntegerVector x, IntegerVector y) {
  LogicalVector ind = in(x, y);
  int n = ind.size();
  std::vector<int> output;
  output.reserve(n);
  for (int i=0; i < n; ++i) {
    if (ind[i]) output.push_back(i+1);
  }
  return output;
}

// [[Rcpp::export]]
std::vector<int> which_in(IntegerVector x, IntegerVector y) {
  int nx = x.size();
  std::unordered_set<int> z(y.begin(), y.end());
  std::vector<int> output;
  output.reserve(nx);
  for (int i=0; i < nx; ++i) {
    if (z.find( x[i] ) != z.end() ) {
      output.push_back(i+1);
    }
  }
  return output;
}


// [[Rcpp::export]]
std::vector<int> which_in2(IntegerVector x, IntegerVector y) {
  std::vector<int> y_sort(y.size());
  std::partial_sort_copy (y.begin(), y.end(), y_sort.begin(), y_sort.end());

  int nx = x.size();
  std::vector<int> out;

  for (int i = 0; i < nx; ++i) {
    std::vector<int>::iterator found =
      lower_bound(y_sort.begin(), y_sort.end(), x[i]);
    if (found != y_sort.end()) {
      out.push_back(i + 1);
    }
  }
  return out;
}

/*** R
set.seed(123)
library(microbenchmark)
x <- sample(1:100)
y <- sample(1:10000, 1000)
identical( sugar_in(y, x), which(y %in% x) )
identical( which_in(y, x), which(y %in% x) )
identical( which_in2(y, x), which(y %in% x) )
microbenchmark(
  sugar_in(y, x),
  which_in(y, x),
  which_in2(y, x),
  which(y %in% x)
)
*/

Calling sourceCpp on this gives me, from the benchmark,

Unit: microseconds
            expr    min      lq  median      uq    max neval
  sugar_in(y, x)  7.590 10.0795 11.4825 14.3630 32.753   100
  which_in(y, x) 40.757 42.4460 43.4400 46.8240 63.690   100
 which_in2(y, x) 14.325 15.2365 16.7005 17.2620 30.580   100
 which(y %in% x) 17.070 21.6145 23.7070 29.0105 78.009   100

The sort of lhs makes the return values order wrong. Try using an unordered set copied from rhs, and transform lhs without duplicating it first? Faster by a log factor too. For extra credit produce the output lazily (mayhap using `boost`). Oh and note that any iterable range is a valid lhs and rhs: dunno if R makes that useless. — Yakk - Adam Nevraumont, Jan 26 '14 at 04:57
What about Armadillo's `find()` as per [this Rcpp Gallery post](http://gallery.rcpp.org/articles/armadillo-subsetting/) ? — Dirk Eddelbuettel, Jan 26 '14 at 17:12
FWIW, Rcpp has an `in` sugar function. https://github.com/RcppCore/Rcpp/blob/master/inst/include/Rcpp/sugar/functions/unique.h#L77 Under the hood it will use something similar to what Kevin shows here. — Romain Francois, Jan 26 '14 at 21:27
You can do a bit better using a binary search. See my answer below. — hadley, Jan 27 '14 at 14:12

hadley · Answer 2 · 2014-01-27T16:58:24.277

For this set of inputs we can eke out a little more performance by using an approach that technically has a higher algorithmic complexity (O(ln n) vs O(1) for each lookup) but has lower constants: a binary search.

// [[Rcpp::plugins("cpp11")]]
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::vector<int> which_in(IntegerVector x, IntegerVector y) {
  int nx = x.size();
  std::unordered_set<int> z(y.begin(), y.end());
  std::vector<int> output;
  output.reserve(nx);
  for (int i=0; i < nx; ++i) {
    if (z.find( x[i] ) != z.end() ) {
      output.push_back(i+1);
    }
  }
  return output;
}

// [[Rcpp::export]]
std::vector<int> which_in2(IntegerVector x, IntegerVector y) {
  std::vector<int> y_sort(y.size());
  std::partial_sort_copy (y.begin(), y.end(), y_sort.begin(), y_sort.end());

  int nx = x.size();
  std::vector<int> out;

  for (int i = 0; i < nx; ++i) {
    std::vector<int>::iterator found =
      lower_bound(y_sort.begin(), y_sort.end(), x[i]);
    if (found != y_sort.end()) {
      out.push_back(i + 1);
    }
  }
  return out;
}

/*** R
set.seed(123)
library(microbenchmark)
x <- sample(1:100)
y <- sample(1:10000, 1000)
identical( which_in(y, x), which(y %in% x) )
identical( which_in2(y, x), which(y %in% x) )
microbenchmark(
  which_in(y, x),
  which_in2(y, x),
  which(y %in% x)
)
*/

On my computer that yields

Unit: microseconds
            expr  min   lq median   uq  max neval
  which_in(y, x) 39.3 41.0   42.7 44.0 81.5   100
 which_in2(y, x) 12.8 13.6   14.4 15.0 23.8   100
 which(y %in% x) 16.8 20.2   21.0 21.9 31.1   100

so about 30% better than base R.

Do you mean to say that `which_in2` has complexity O(n log n)? — Matthew Lundberg, Jan 27 '14 at 14:23
@MatthewLundberg oops, I got confused. I meant O(1), referring to individual lookups. — hadley, Jan 27 '14 at 16:59

A C++ version of the %in% operator in R

2 Answers2

Linked