1

I am looking for the most computationally efficient way to filter out arrays from one array, based on segment overlap in a second array, and that array has a different segmentation type.

This is my first array

iob = np.array(
    [0, 1, 2, 2, 2, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 2, 0, 0, 1, 2, 1, 2, 2, 0]
)

The number 1 is the start of each segment, the number 2 is the rest of the segment, and the number 0 indicates no segment. So for this array, the segments are 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, and 1, 2, 2.

This is my second array

output = np.array(
    [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]
)

The segments are only defined by the number 1, and 0 indicates no segment. So the segments for this array are 1, 1, 1, 1, 1, 1, and 1, 1, 1.

In the first array, I want filter out segments whose contents to not overlap by at least 50% by a segment in the 2nd array. In other words, if at least half of the contents in a segment in the first array overlaps with a segment in the 2nd array, I want to keep that segment.

So this is the desired result

array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0])

I am looking for the most computationally efficient method to calculate this solution.

Current solution

I can get the segment indices using the technique described here https://stackoverflow.com/a/71297663/3259896

output_zero = np.concatenate((output, [0]))
zero_output = np.concatenate(([0], output))
iob_zero = np.concatenate((iob, [0]))

iob_starts = np.where(iob_zero == 1)[0]
iob_ends = np.where((iob_zero[:-1] != 0) & (iob_zero[1:] != 2))[0]
iob_pairs = np.column_stack((iob_starts, iob_ends))

output_starts = np.where((zero_output[:-1] == 0) & (zero_output[1:] == 1))[0]
output_ends = np.where((output_zero[:-1] == 1) & (output_zero[1:] == 0))[0]
output_pairs = np.column_stack((output_starts, output_ends))

Next, I directly compare all possible combination of segments to see which ones have at least a 50% overlap, and only keep those segments.

valid_pairs = []
for o_p in output_pairs:
    for i_p in iob_pairs:
        overlap = 1 + min(o_p[1], i_p[1]) - max(o_p[0], i_p[0])
        if overlap > np.diff(i_p)[0]/2:
            valid_pairs.append(i_p)
valid_pairs = np.array(valid_pairs)

Finally, I used the filtered indices to create the desired array

final = np.zeros_like(output)
for block in valid_pairs:
   final[block[0]:block[1]+1] = 1
final
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0])

I suspect that this may not be the most computationally efficient solution. It uses quite a few lines of code, and also used a nested for loop to do all possible comparisons, and uses another loop to create the desired array. I don't have a mastery on use of all of numpy's functions, and I am wondering if there is a more computationally efficient way to calculate this.

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116

1 Answers1

1

Here's a 4-liner:

output = np.concatenate((output, [0]))
iob = np.concatenate((iob, [0]))
idx = np.dstack(np.where(iob == 1) + tuple(np.array(np.where(np.diff(iob) < 0)) + 1)).flatten()
desired_array = np.concatenate([np.full(len(x), 1 if ((np.sum(x==2) >= np.sum(x==1)) and x.sum()>0) else 0) for x in np.split(np.where(iob>0, 1, 0) + output, idx)])[0:-1]

Output:

>>> desired_array
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0])

And here is a more verbose version


output_0 = np.concatenate((output, [0]))
iob_0 = np.concatenate((iob, [0]))
overlap = np.where(iob_0>0, 1, 0) + output_0

iob_segment_starts = np.where(iob_0 == 1)
iob_segment_ends = tuple(np.array(np.where(np.diff(iob_0) < 0)) + 1)
iob_segment_indxs = np.dstack(iob_segment_starts + iob_segment_ends).flatten()
overlap_segments = np.split(overlap, iob_segment_indxs)
filtered_segment_params = (
    (
        len(x), 
        np.sum(x == 2) >= np.sum(x == 1) and x.sum() > 0,
    )
    for x in overlap_segments
)
filtered_segment_pieces = [
    np.full(length, value, dtype=int)
    for length, value in filtered_segment_params
]
filtered_array = np.concatenate(filtered_segment_pieces)[:-1]
filtered_array

Pytorch Version

import torch
output = torch.tensor(
    [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1]
)
iob = torch.tensor(
    [0, 1, 2, 2, 2, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 2, 0, 0, 1, 2, 1, 2, 2, 1]
)
output_0 = torch.cat((output, torch.tensor([0])))
iob_0 = torch.cat((iob, torch.tensor([0])))
overlap = torch.where(iob_0>0, 1, 0) + output_0

iob_segment_starts = torch.where(iob_0 == 1)[0]
iob_segment_ends = torch.where(torch.diff(iob_0) < 0)[0] + 1
iob_segment_cuts = torch.cat([
    iob_segment_starts, 
    iob_segment_ends,
    torch.tensor([0, overlap.shape[0]])
]).sort()[0]
iob_segment_sizes = torch.diff(iob_segment_cuts)
overlap_segments = torch.split(overlap, iob_segment_sizes.tolist(), dim=0)

filtered_segment_params = (
    (
        len(x), 
        torch.sum(x == 2) >= torch.sum(x == 1) and x.sum() > 0,
    )
    for x in overlap_segments
)
filtered_segment_pieces = [
    torch.full((1, length), value, dtype=int).flatten(0)
    for length, value in filtered_segment_params
]
filtered_array = torch.cat(filtered_segment_pieces)[:-1]
filtered_array
SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • Solution doesn't quite work for all cases, but I think `desired_array = np.concatenate([np.full(len(x), 1 if ((np.sum(x==3) >= np.sum(x==2)) and x.sum()>0) else 0) for x in np.split(np.where(iob>0, 2, 0) + output, idx)])` covers it. What do you think? – SantoshGupta7 Mar 20 '22 at 05:57
  • Ah yes, I see. Yes, that looks pretty good to me! –  Mar 20 '22 at 14:14
  • 1
    one more update, it seems to miss edge cases where there are segments at the very end. Using `output = np.concatenate((output, [0]))` and `iob = np.concatenate((iob, [0]))' and `desired_array[0:-1]` seems to fix this. Updating your answer to include these. Still looking for other edge cases. – SantoshGupta7 Mar 20 '22 at 22:55