I am looking for the most computationally efficient way to filter out arrays from one array, based on segment overlap in a second array, and that array has a different segmentation type.
This is my first array
iob = np.array(
[0, 1, 2, 2, 2, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 2, 0, 0, 1, 2, 1, 2, 2, 0]
)
The number 1 is the start of each segment, the number 2 is the rest of the segment, and the number 0 indicates no segment. So for this array, the segments are 1, 2, 2, 2, 2, 2, 2, 2
, 1, 2, 2, 2, 2
, 1, 2
, and 1, 2, 2
.
This is my second array
output = np.array(
[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]
)
The segments are only defined by the number 1, and 0 indicates no segment. So the segments for this array are 1, 1, 1, 1
, 1, 1
, and 1, 1, 1
.
In the first array, I want filter out segments whose contents to not overlap by at least 50% by a segment in the 2nd array. In other words, if at least half of the contents in a segment in the first array overlaps with a segment in the 2nd array, I want to keep that segment.
So this is the desired result
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0])
I am looking for the most computationally efficient method to calculate this solution.
Current solution
I can get the segment indices using the technique described here https://stackoverflow.com/a/71297663/3259896
output_zero = np.concatenate((output, [0]))
zero_output = np.concatenate(([0], output))
iob_zero = np.concatenate((iob, [0]))
iob_starts = np.where(iob_zero == 1)[0]
iob_ends = np.where((iob_zero[:-1] != 0) & (iob_zero[1:] != 2))[0]
iob_pairs = np.column_stack((iob_starts, iob_ends))
output_starts = np.where((zero_output[:-1] == 0) & (zero_output[1:] == 1))[0]
output_ends = np.where((output_zero[:-1] == 1) & (output_zero[1:] == 0))[0]
output_pairs = np.column_stack((output_starts, output_ends))
Next, I directly compare all possible combination of segments to see which ones have at least a 50% overlap, and only keep those segments.
valid_pairs = []
for o_p in output_pairs:
for i_p in iob_pairs:
overlap = 1 + min(o_p[1], i_p[1]) - max(o_p[0], i_p[0])
if overlap > np.diff(i_p)[0]/2:
valid_pairs.append(i_p)
valid_pairs = np.array(valid_pairs)
Finally, I used the filtered indices to create the desired array
final = np.zeros_like(output)
for block in valid_pairs:
final[block[0]:block[1]+1] = 1
final
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0])
I suspect that this may not be the most computationally efficient solution. It uses quite a few lines of code, and also used a nested for loop to do all possible comparisons, and uses another loop to create the desired array. I don't have a mastery on use of all of numpy's functions, and I am wondering if there is a more computationally efficient way to calculate this.