2

I have a process that is iteratively and randomly pruning a huge vector of integers and I want to find what elements are removed between each iteration. This vector has a lot of repetitions and using ismember() and setdiff() doesn't helped me much.

As an illustration if X = [1,10,8,5,10,3,5,2]:

step 0: X = 1,10,8,5,10,3,5,2
step 1: X = 1,10,8,10,3,5,2 (5 is removed)
step 2: X = 1,10,8,3,2 (10 and 5 are removed)
step 3: X = 10,8,3,2 (1 is removed)
step 4: X = 2 (10, 8 and 3 are removed)
step 5: X = [] (2 is finally removed)

I aim at finding the elements removed at each steps (ie. 5 then, 10 and 5 and so on). I could possibly find an overly complicated solution using hist(X, unique(X)) between steps, but I assume there exists a much more elegant (and cheaper!) solution in matlab.

Grasshoper
  • 457
  • 2
  • 13
  • 2
    How large is `X` and how many unique elements does it have, typically? – Luis Mendo Mar 12 '19 at 18:28
  • 4
    Wouldn't it be simpler to have this process (a function I presume) also return the removed element? – Cris Luengo Mar 12 '19 at 18:36
  • 2
    Also, are the values always positive integers? – Luis Mendo Mar 12 '19 at 18:41
  • @LuisMendo X typically contains hundreds and at most a thousands and yes values are always positive. – Grasshoper Mar 12 '19 at 21:59
  • @CrisLuengo sure it would, the issue is that the process generating X values takes a huge time to compute and I am processing the results tight now. – Grasshoper Mar 12 '19 at 21:59
  • @Grasshoper I don't understand your last comment about the process generating the X values. Can you clarify? – beaker Mar 12 '19 at 22:01
  • And if `X` only contains a few thousand elements, wouldn't `find` be quick enough? – beaker Mar 12 '19 at 22:17
  • @beaker The process that generates the X values is the pruning of the elements of a huge structure according to some strategy and constraints. Not sure how to use find() in the given context, do you suggest to use a loop and counts the frequency of each element? That is building a histogram. – Grasshoper Mar 13 '19 at 08:09
  • 1
    @Grasshoper No, I wasn't suggesting a histogram approach. Given two arrays `X` and `Y`, find the first element where `X ~= Y`. This is the first removed element. Now apply to the remaining arrays *after* the mismatch. The solution is *O(kn)*, where `k` is the number of removed elements and `n` is the length of `X`. However that's probably slower and more coding than the `histc(unique)` solution suggested [here](https://stackoverflow.com/questions/51829635/finding-multiset-difference-between-two-arrays). – beaker Mar 13 '19 at 16:26
  • @beaker yes, got it. – Grasshoper Mar 14 '19 at 08:46
  • @Grasshoper Marking as duplicate then – Luis Mendo Mar 14 '19 at 08:47

2 Answers2

3

I came up with the idea to recover the input from the output by subtracting both and iterating the differing values, which then are the to be found indices of the removed elements.

% Input.
X = [1, 10, 8, 5, 10, 3, 5, 2];

% Remove indices for the given example.
y = { [4], [4 6], [1], [1 2 3], [1] };

% Simulate removing.
for k = 1:numel(y)

  % Remove elements.
  temp = X;
  temp(y{k}) = [];

  % Determine number of removed elements.
  nRemoved = numel(X) - numel(temp);

  % Find removed elements by recovering input from output.
  recover = temp;
  removed = zeros(1, nRemoved);
  for l = 1:nRemoved
    tempdiff = X - [recover zeros(1, nRemoved - l + 1)];
    idx = find(tempdiff, 1);
    removed(l) = X(idx);
    recover = [recover(1:idx-1) X(idx) recover(idx:end)];
  end

  % Simple, stupid output.
  disp('Input:');
  disp(X);
  disp('');
  disp('Output:');
  disp(temp);
  disp('');
  disp('Removed elements:');
  disp(removed);
  disp('');
  disp('------------------------------');

  % Reset input.
  X = temp;

end

Output for the given example:

Input:
    1   10    8    5   10    3    5    2

Output:
    1   10    8   10    3    5    2

Removed elements:
 5

------------------------------
Input:
    1   10    8   10    3    5    2

Output:
    1   10    8    3    2

Removed elements:
   10    5

------------------------------
Input:
    1   10    8    3    2

Output:
   10    8    3    2

Removed elements:
 1

------------------------------
Input:
   10    8    3    2

Output:
 2

Removed elements:
   10    8    3

------------------------------
Input:
 2

Output:
[](1x0)

Removed elements:
 2

------------------------------

Is that an appropriate solution, or am I missing some (obvious) inefficiencies?

HansHirse
  • 18,010
  • 10
  • 38
  • 67
2
  1. This approach is memory-intensive. It computes an intermediate matrix of size NxM where N is the number of elements of X and M is the number of unique elements of X, using implicit expansion. This may be feasible or not depending on your typical N and M.

    X = [1,10,8,5,10,3,5,2];
    Y = [8,10,2,1]; % removed 10, 5, 5, 3. Order in Y is arbitrary
    u = unique(X(:).');
    removed = repelem(u, sum(X(:)==u,1)-sum(Y(:)==u,1));
    

    gives

    removed =
         3     5     5    10
    

    For Matlab versions before R2016b, you need bsxfun instead of implicit expansion:

    removed = repelem(u, sum(bsxfun(@eq,X(:),u),1)-sum(bsxfun(@eq,Y(:),u),1));
    
  2. If the values in X are always positive integers, a more efficient approach can be used, employing sparse to compute the number of times each element appears:

    X = [1,10,8,5,10,3,5,2];
    Y = [8,10,2,1]; % removed 10, 5, 5, 3. Order in Y is arbitrary
    removed = repelem(1:max(X), sparse(1,X,1) - sparse(1,Y,1));
    
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147