Extracting boundaries of dense regions of 1s in a huge list of 1s and 0s

Question

I'm not sure how to word my problem. But here it is...

I have a huge list of 1s and 0s [Total length = 53820].

Example of how the list looks like - [0,1,1,1,1,1,1,1,1,0,0,0,1,1,0,0,0,0,0,0,1,1...........]

The visualization is given below.

x-axis: index of the element (from 0 to 53820)

y-axis: value at that index (i.e. 1 or 0)

Input Plot-->

The plot clearly shows 3 dense areas where the occurrence of 1s is more. I have drawn on top of the plot to show the visually dense areas. (ugly black lines on the plot). I want to know the index numbers on the x-axis of the dense areas (start and end boundaries) on the plot.

I have extracting the chunks of 1s and saving the start indexes of each in a new list named 'starts'. That function returns a list of dictionaries like this:

{'start': 0, 'count': 15, 'end': 16}, {'start': 2138, 'count': 3, 'end': 2142}, {'start': 2142, 'count': 3, 'end': 2146}, {'start': 2461, 'count': 1, 'end': 2463}, {'start': 2479, 'count': 45, 'end': 2525}, {'start': 2540, 'count': 2, 'end': 2543}

Then in starts, after setting a threshold, compared adjacent elements. Which returns the apparent boundaries of the dense areas.

THR = 2000
    results = []
    cues = {'start': 0, 'stop': 0}  
    result,starts = densest(preds) # Function that returns the list of dictionaries shown above
    cuestart = False # Flag to check if looking for start or stop of dense boundary
    for i,j in zip(range(0,len(starts)), range(1,len(starts))):
        now = starts[i]
        nextf = starts[j]

        if(nextf-now > THR):
            if(cuestart == False):
                cues['start'] = nextf
                cues['stop'] = nextf
                cuestart = True

            elif(cuestart == True): # Cuestart is already set
                cues['stop'] = now
                cuestart = False
                results.append(cues)
                cues = {'start': 0, 'stop': 0}

    print('\n',results)

The output and corresponding plot looks like this.

[{'start': 2138, 'stop': 6654}, {'start': 23785, 'stop': 31553}, {'start': 38765, 'stop': 38765}]

Output Plot -->

This method fails to get the last dense region as seen in the plot, and also for other data of similar sorts.

P.S. I have also tried 'KDE' on this data and 'distplot' using seaborn but that gives me plots directly and I am unable to extract the boundary values from that. The link for that question is here (Getting dense region boundary values from output of KDE plot)

So do the ones have to be consecutive? Or it might just be a region along the x-axis with a higher concentration of ones? — yatu, May 14 '19 at 12:36
The ones aren't exactly in a pattern. But they do appear consecutive or you can say in chunks. The size of those chunks aren't fixed. @yatu — Darpan, May 14 '19 at 12:40
Welcome to stackoverflow! Interesting question. I took the opportunity to help out by adding images directly. — vidstige, May 14 '19 at 12:48
I don't feel like writing an answer, but ... In the hypothesis that your vector of `0` and `1` is a Numpy array, try `plt.plot((y-0.5).cumsum())` --- as you can see your intervals start (approximately) at a global minimum and end at a global maximum... in `scipy.optimize` you possibly could find a function to identify those extrema. — gboffi, May 14 '19 at 14:06
Forgot to mention, you'll find plenty of local extrema in the cumulative sum, probably you want to smooth (windowing, running mean etc) the sum to remove them — gboffi, May 14 '19 at 14:13
@gboffi Thanks for the suggestion! I don't think I fully understand why you used the ``` mylist-0.5 ``` as input to the cumsum function. What is 0.5 used for? — Darpan, May 15 '19 at 06:51
I have used the 'cumsum()' function on my list and obtain a plot with a line with negative slope and irregularities near the dense regions. But the problem of getting the boundaries from that plot pertains. I have faced the same issue with KDE, have the plot but am unable to extract the boundaries from the plot. — Darpan, May 15 '19 at 06:51
I have suggested `(y-0.5).cumsum()` because `1`s turn to `+0.5` and `0`s turn to `-0.5`, so that you have a change (an inversion) of the mean slope of the cumulative sum when you have a change of density of ones. To get the main features (your regions' boundaries) you may have to use a low-pass filter, possibly a simple running mean. — gboffi, May 15 '19 at 07:45
@gboffi Oh, okay I get it! And how will the mean of the `cumsum()` function output give me 6 values (3 starts and 3 stops)? I'm not very clear on how to implement the low-pass filter. It would be nice if you point me to some documentation. Thanks! — Darpan, May 15 '19 at 07:56

score 1 · Answer 1 · answered May 15 '19 at 08:43

OK, you need an answer...

First, the imports (we are going to use LineCollections)

import numpy as np ; import matplotlib.pyplot as plt ;                           
from matplotlib.collections import LineCollection

Next, definition of constants

N = 1001 ; np.random.seed(20190515)

and generation of fake data

x = np.linspace(0,1, 1001)                                                       
prob = np.where(x<0.4, 0.02, np.where(x<0.7, 0.95, 0.02))                        
y = np.where(np.random.rand(1001)<prob, 1, 0)

here we create the line collection, sticks is a N×2×2 array containing the start and end points of our vertical lines

sticks = np.array(list(zip(zip(x, np.zeros(N)), zip(x, y))))                                  
lc = LineCollection(sticks)

finally, the cumulated sum, here normalized to have the same scale as the vertical lines

cs = (y-0.5).cumsum()                                                            
csmin, csmax = min(cs), max(cs)                                                  
cs = (cs-csmin)/(csmax-csmin) # normalized to 0 ÷ 1

We have just to plot our results

f, a = plt.subplots()                                                            
a.add_collection(lc)                                                             
a.plot(x, cs, color='red')                                                       
a.grid()                                                                         
a.autoscale()

Here it is the plot

and here a detail of the stop zone.

You can smooth the cs data and use something from scipy.optimize to spot the position of extremes. Should you have a problem in this last step please ask another question.

Thank you for taking time out for this! I will try this out on my data and see what results I get! — Darpan, May 16 '19 at 06:30

Extracting boundaries of dense regions of 1s in a huge list of 1s and 0s

1 Answers1

Linked