17

Is there any function in C++ equivalent to %in% operator in R? Consider the command below in R:

which(y %in% x)

I tried to find something equivalent in C++ (specifically in Armadillo) and I couldn't find anything. I then wrote my own function which is very slow compared to the R command above.

Here is what I wrote:

#include <RcppArmadillo.h>
// [[Rcpp::depends("RcppArmadillo")]]

// [[Rcpp::export]]
arma::uvec myInOperator(arma::vec myBigVec, arma::vec mySmallVec ){
 arma::uvec rslt = find(myBigVec == mySmallVec[0]);
 for (int i = 1; i < mySmallVec.size(); i++){
   arma::uvec rslt_tmp = find(myBigVec == mySmallVec[i]);
   rslt = arma::unique(join_cols( rslt, rslt_tmp ));
 }
 return rslt;
}

Now after sourcing in the code above, we have:

x <- 1:4
y <- 1:10
res <- benchmark(myInOperator(y, x), which(y %in% x), columns = c("test",
      "replications", "elapsed", "relative", "user.self", "sys.self"), 
       order = "relative")

And here are the results:

                 test replications elapsed relative user.self sys.self
 2    which(y %in% x)          100   0.001        1     0.001        0
 1 myInOperator(y, x)          100   0.002        2     0.001        0

Could anyone guide me either on finding a C++ code corresponding to which(y %in% x) or on making my code more efficient? The elapsed time is already very small for both functions. I guess what I meant by efficiency is more from programming perspective and on whether the way I thought about the problem and the commands I used are efficient.

I appreciate your help.

Neels
  • 2,547
  • 6
  • 33
  • 40
Sam
  • 4,357
  • 6
  • 36
  • 60
  • 1
    It will be hard to beat `which(y %in% x)`, since `%in%` (calling `match`) and `which` are already `.Internal` functions, and hence implemented in C and likely optimized. It is possible that one could improve performance by avoiding the temporary logical vector generated by `%in%` though. – Kevin Ushey Jan 26 '14 at 04:09
  • 1
    It would probably help a little if you pass stuff by reference instead of making copies, `(const arma::vec & myBigVec, const arma::vec & mySmallVec )` – OneOfOne Jan 26 '14 at 04:20
  • 2
    I dont't know enough about RCpp (and nothing of Armidillo) so I can't answer the question. However, if I were doing this in C++, I would look to `std::set_intersection`. – Matthew Lundberg Jan 26 '14 at 04:21
  • @KevinUshey, would you happen to know a way to use %in% in C++ by using Rcpp? Is there a way to call which and %in%. While I can easily call other R functions, I was not successful to use which and I have no idea how to use %in%. I really appreciate your help. – Sam Jan 26 '14 at 04:21
  • 1
    For a question like this (how to duplicate/imitate feature X of language Y in language Z) it's *extremely* helpful if you describe exactly what the feature does--or at least the subset of its capabilities you care about. – Jerry Coffin Jan 26 '14 at 04:36
  • 1
    @JerryCoffin `y %in% x` returns a logical vector the same length as `y`, value `TRUE` if the element of `y` is in `x`, else `FALSE`. `which` returns the numeric indices (1-based) of the `TRUE` values in a vector. – Matthew Lundberg Jan 26 '14 at 04:39
  • @matthewlundberg what is a 'logical vector'? How is it distinct from a `std::vector`? Is it lazy? Do you want `%in%` and `which`, or just `which(a %in% b)`? – Yakk - Adam Nevraumont Jan 26 '14 at 05:01
  • 2
    @Yakk The question includes the R tag. These of course are constructs of R. A "logical vector" in R only relates to a `std::vector` when RCpp is involved. Perhaps this question should not have the C++ tag applied, as R and RCpp are a bit different. But the message here is, if a question has tags about a language with which you are unfamiliar, you might refrain from answering or commenting. That's what I do with answers with tags in languages that I do not know. – Matthew Lundberg Jan 26 '14 at 05:06
  • @matthewlundberg except the OP wants C++ code: I can produce that code if the specs are clear. The set of people competent at two random languages is going to be much smaller: if, as the below answer (now deleted) implies, a pure C++ algorithm on `std::vector` suffices, the `R` tag becomes redundant! – Yakk - Adam Nevraumont Jan 26 '14 at 05:12
  • @Yakk But it's not pure! It relies on RCpp and it must be callable from the R interpreter. If you can do that, great. If not, well... – Matthew Lundberg Jan 26 '14 at 05:13
  • For the purposes of this question, a pure C++ solution could easily be made callable using `Rcpp`, but it would not necessarily be the fastest due to data copies forced in going from R containers to C++ containers. There are ways around this, but it is not the default behavior. – Kevin Ushey Jan 26 '14 at 05:34
  • @MatthewLundberg: We tend to spell it Rcpp (lower-case c) and have done since 2008. – Dirk Eddelbuettel Jan 26 '14 at 17:07
  • @DirkEddelbuettel I will mind that in the future. Thanks. – Matthew Lundberg Jan 26 '14 at 18:23

2 Answers2

15

EDIT: Thanks to @MatthewLundberg and @Yakk for catching my silly errors.

If what you really want is just faster matching, you should check out Simon Urbanek's fastmatch package. However, Rcpp does in fact have a sugar in function which can be used here. in uses some of the ideas from the fastmatch package and incorporates them into Rcpp. I also compare @hadley's solution here.

// [[Rcpp::plugins("cpp11")]]
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::vector<int> sugar_in(IntegerVector x, IntegerVector y) {
  LogicalVector ind = in(x, y);
  int n = ind.size();
  std::vector<int> output;
  output.reserve(n);
  for (int i=0; i < n; ++i) {
    if (ind[i]) output.push_back(i+1);
  }
  return output;
}

// [[Rcpp::export]]
std::vector<int> which_in(IntegerVector x, IntegerVector y) {
  int nx = x.size();
  std::unordered_set<int> z(y.begin(), y.end());
  std::vector<int> output;
  output.reserve(nx);
  for (int i=0; i < nx; ++i) {
    if (z.find( x[i] ) != z.end() ) {
      output.push_back(i+1);
    }
  }
  return output;
}


// [[Rcpp::export]]
std::vector<int> which_in2(IntegerVector x, IntegerVector y) {
  std::vector<int> y_sort(y.size());
  std::partial_sort_copy (y.begin(), y.end(), y_sort.begin(), y_sort.end());

  int nx = x.size();
  std::vector<int> out;

  for (int i = 0; i < nx; ++i) {
    std::vector<int>::iterator found =
      lower_bound(y_sort.begin(), y_sort.end(), x[i]);
    if (found != y_sort.end()) {
      out.push_back(i + 1);
    }
  }
  return out;
}

/*** R
set.seed(123)
library(microbenchmark)
x <- sample(1:100)
y <- sample(1:10000, 1000)
identical( sugar_in(y, x), which(y %in% x) )
identical( which_in(y, x), which(y %in% x) )
identical( which_in2(y, x), which(y %in% x) )
microbenchmark(
  sugar_in(y, x),
  which_in(y, x),
  which_in2(y, x),
  which(y %in% x)
)
*/

Calling sourceCpp on this gives me, from the benchmark,

Unit: microseconds
            expr    min      lq  median      uq    max neval
  sugar_in(y, x)  7.590 10.0795 11.4825 14.3630 32.753   100
  which_in(y, x) 40.757 42.4460 43.4400 46.8240 63.690   100
 which_in2(y, x) 14.325 15.2365 16.7005 17.2620 30.580   100
 which(y %in% x) 17.070 21.6145 23.7070 29.0105 78.009   100
Kevin Ushey
  • 20,530
  • 5
  • 56
  • 88
  • 1
    Must you copy the vectors when you call `which_in`? – Matthew Lundberg Jan 26 '14 at 04:48
  • 1
    The sort of lhs makes the return values order wrong. Try using an unordered set copied from rhs, and transform lhs without duplicating it first? Faster by a log factor too. For extra credit produce the output lazily (mayhap using `boost`). Oh and note that any iterable range is a valid lhs and rhs: dunno if R makes that useless. – Yakk - Adam Nevraumont Jan 26 '14 at 04:57
  • 1
    What about Armadillo's `find()` as per [this Rcpp Gallery post](http://gallery.rcpp.org/articles/armadillo-subsetting/) ? – Dirk Eddelbuettel Jan 26 '14 at 17:12
  • 3
    FWIW, Rcpp has an `in` sugar function. https://github.com/RcppCore/Rcpp/blob/master/inst/include/Rcpp/sugar/functions/unique.h#L77 Under the hood it will use something similar to what Kevin shows here. – Romain Francois Jan 26 '14 at 21:27
  • You can do a bit better using a binary search. See my answer below. – hadley Jan 27 '14 at 14:12
  • 1
    Looks like Rcpp's sugar can still win the battle :) – Kevin Ushey Jan 27 '14 at 18:18
9

For this set of inputs we can eke out a little more performance by using an approach that technically has a higher algorithmic complexity (O(ln n) vs O(1) for each lookup) but has lower constants: a binary search.

// [[Rcpp::plugins("cpp11")]]
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::vector<int> which_in(IntegerVector x, IntegerVector y) {
  int nx = x.size();
  std::unordered_set<int> z(y.begin(), y.end());
  std::vector<int> output;
  output.reserve(nx);
  for (int i=0; i < nx; ++i) {
    if (z.find( x[i] ) != z.end() ) {
      output.push_back(i+1);
    }
  }
  return output;
}

// [[Rcpp::export]]
std::vector<int> which_in2(IntegerVector x, IntegerVector y) {
  std::vector<int> y_sort(y.size());
  std::partial_sort_copy (y.begin(), y.end(), y_sort.begin(), y_sort.end());

  int nx = x.size();
  std::vector<int> out;

  for (int i = 0; i < nx; ++i) {
    std::vector<int>::iterator found =
      lower_bound(y_sort.begin(), y_sort.end(), x[i]);
    if (found != y_sort.end()) {
      out.push_back(i + 1);
    }
  }
  return out;
}

/*** R
set.seed(123)
library(microbenchmark)
x <- sample(1:100)
y <- sample(1:10000, 1000)
identical( which_in(y, x), which(y %in% x) )
identical( which_in2(y, x), which(y %in% x) )
microbenchmark(
  which_in(y, x),
  which_in2(y, x),
  which(y %in% x)
)
*/

On my computer that yields

Unit: microseconds
            expr  min   lq median   uq  max neval
  which_in(y, x) 39.3 41.0   42.7 44.0 81.5   100
 which_in2(y, x) 12.8 13.6   14.4 15.0 23.8   100
 which(y %in% x) 16.8 20.2   21.0 21.9 31.1   100

so about 30% better than base R.

hadley
  • 102,019
  • 32
  • 183
  • 245