The way to do this efficiently, in a way that scales, is to use PDL::VectorValued::Utils, with two ndarrays (the "haystack" being an ndarray, not a Perl array of ndarrays). The little function vv_in
is not shown copy-pasted into the perldl
CLI because it would be less copy-pastable from this answer:
sub vv_in {
require PDL::VectorValued::Utils;
my ($needle, $haystack) = @_;
die "needle must have 1 dim less than haystack"
if $needle->ndims != $haystack->ndims - 1;
my $ign = $needle->dummy(1)->zeroes;
PDL::_vv_intersect_int($needle->dummy(1), $haystack, $ign, my $nc=PDL->null);
$nc;
}
pdl> p $titi = pdl(1,2,3)
[1 2 3]
pdl> p $toto = pdl([1,2,3], [4,5,6])
[
[1 2 3]
[4 5 6]
]
pdl> p $notin = pdl(7,8,9)
[7 8 9]
pdl> p vv_in($titi, $toto)
[1]
pdl> p vv_in($notin, $toto)
[0]
Note that for efficiency, the $haystack
is required to be sorted already (use qsortvec
). The dummy
"inflates" the $needle
to be a vector-set with one vector, then vv_intersect
returns two ndarrays:
- either the intersecting vector-set (which would always be a single vector here), or a set of zeroes (probably a shortcoming of the routine, it should instead be
vectorlength,0
- an empty ndarray)
- the quantity of vectors found (here, either 0 or 1)
The "internal" (_vv_intersect_int
) version is used because as of PDL::VectorValued 1.0.15, it has some wrapping Perl code that does not permit broadcasting (an issue has been filed).
Note vv_in
will "broadcast" (formerly known, confusingly, as "threading") over multiple sets of input-vectors and input-haystacks. This could be used to search for several vectors:
sub vv_in_multi {
my ($needles, $haystack) = @_;
die "needles must have same number of dims as haystack"
if $needles->ndims != $haystack->ndims;
vv_in($needles, $haystack->dummy(-1));
}
pdl> p vv_in_multi(pdl($titi,$notin), $toto)
[1 0]