4

I'm not sure how to word my problem. But here it is...

I have a huge list of 1s and 0s [Total length = 53820].

Example of how the list looks like - [0,1,1,1,1,1,1,1,1,0,0,0,1,1,0,0,0,0,0,0,1,1...........]

The visualization is given below.

x-axis: index of the element (from 0 to 53820)

y-axis: value at that index (i.e. 1 or 0)

Input Plot-->

The plots

The plot clearly shows 3 dense areas where the occurrence of 1s is more. I have drawn on top of the plot to show the visually dense areas. (ugly black lines on the plot). I want to know the index numbers on the x-axis of the dense areas (start and end boundaries) on the plot.

I have extracting the chunks of 1s and saving the start indexes of each in a new list named 'starts'. That function returns a list of dictionaries like this:

{'start': 0, 'count': 15, 'end': 16}, {'start': 2138, 'count': 3, 'end': 2142}, {'start': 2142, 'count': 3, 'end': 2146}, {'start': 2461, 'count': 1, 'end': 2463}, {'start': 2479, 'count': 45, 'end': 2525}, {'start': 2540, 'count': 2, 'end': 2543}

Then in starts, after setting a threshold, compared adjacent elements. Which returns the apparent boundaries of the dense areas.

THR = 2000
    results = []
    cues = {'start': 0, 'stop': 0}  
    result,starts = densest(preds) # Function that returns the list of dictionaries shown above
    cuestart = False # Flag to check if looking for start or stop of dense boundary
    for i,j in zip(range(0,len(starts)), range(1,len(starts))):
        now = starts[i]
        nextf = starts[j]

        if(nextf-now > THR):
            if(cuestart == False):
                cues['start'] = nextf
                cues['stop'] = nextf
                cuestart = True

            elif(cuestart == True): # Cuestart is already set
                cues['stop'] = now
                cuestart = False
                results.append(cues)
                cues = {'start': 0, 'stop': 0}

    print('\n',results)

The output and corresponding plot looks like this.

[{'start': 2138, 'stop': 6654}, {'start': 23785, 'stop': 31553}, {'start': 38765, 'stop': 38765}]

Output Plot -->

Output plot

This method fails to get the last dense region as seen in the plot, and also for other data of similar sorts.

P.S. I have also tried 'KDE' on this data and 'distplot' using seaborn but that gives me plots directly and I am unable to extract the boundary values from that. The link for that question is here (Getting dense region boundary values from output of KDE plot)

Jan Doggen
  • 8,799
  • 13
  • 70
  • 144
Darpan
  • 61
  • 5
  • So do the ones have to be consecutive? Or it might just be a region along the x-axis with a higher concentration of ones? – yatu May 14 '19 at 12:36
  • The ones aren't exactly in a pattern. But they do appear consecutive or you can say in chunks. The size of those chunks aren't fixed. @yatu – Darpan May 14 '19 at 12:40
  • Welcome to stackoverflow! Interesting question. I took the opportunity to help out by adding images directly. – vidstige May 14 '19 at 12:48
  • I don't feel like writing an answer, but ... In the hypothesis that your vector of `0` and `1` is a Numpy array, try `plt.plot((y-0.5).cumsum())` --- as you can see your intervals start (approximately) at a global minimum and end at a global maximum... in `scipy.optimize` you possibly could find a function to identify those extrema. – gboffi May 14 '19 at 14:06
  • Forgot to mention, you'll find plenty of local extrema in the cumulative sum, probably you want to smooth (windowing, running mean etc) the sum to remove them – gboffi May 14 '19 at 14:13
  • @gboffi Thanks for the suggestion! I don't think I fully understand why you used the ``` mylist-0.5 ``` as input to the cumsum function. What is 0.5 used for? – Darpan May 15 '19 at 06:51
  • I have used the 'cumsum()' function on my list and obtain a plot with a line with negative slope and irregularities near the dense regions. But the problem of getting the boundaries from that plot pertains. I have faced the same issue with KDE, have the plot but am unable to extract the boundaries from the plot. – Darpan May 15 '19 at 06:51
  • I have suggested `(y-0.5).cumsum()` because `1`s turn to `+0.5` and `0`s turn to `-0.5`, so that you have a change (an inversion) of the mean slope of the cumulative sum when you have a change of density of ones. To get the main features (your regions' boundaries) you may have to use a low-pass filter, possibly a simple running mean. – gboffi May 15 '19 at 07:45
  • @gboffi Oh, okay I get it! And how will the mean of the `cumsum()` function output give me 6 values (3 starts and 3 stops)? I'm not very clear on how to implement the low-pass filter. It would be nice if you point me to some documentation. Thanks! – Darpan May 15 '19 at 07:56

1 Answers1

1

OK, you need an answer...

First, the imports (we are going to use LineCollections)

import numpy as np ; import matplotlib.pyplot as plt ;                           
from matplotlib.collections import LineCollection                                

Next, definition of constants

N = 1001 ; np.random.seed(20190515)                                              

and generation of fake data

x = np.linspace(0,1, 1001)                                                       
prob = np.where(x<0.4, 0.02, np.where(x<0.7, 0.95, 0.02))                        
y = np.where(np.random.rand(1001)<prob, 1, 0)                                    

here we create the line collection, sticks is a N×2×2 array containing the start and end points of our vertical lines

sticks = np.array(list(zip(zip(x, np.zeros(N)), zip(x, y))))                                  
lc = LineCollection(sticks)                                                      

finally, the cumulated sum, here normalized to have the same scale as the vertical lines

cs = (y-0.5).cumsum()                                                            
csmin, csmax = min(cs), max(cs)                                                  
cs = (cs-csmin)/(csmax-csmin) # normalized to 0 ÷ 1                              

We have just to plot our results

f, a = plt.subplots()                                                            
a.add_collection(lc)                                                             
a.plot(x, cs, color='red')                                                       
a.grid()                                                                         
a.autoscale()                                                                    

Here it is the plot

enter image description here

and here a detail of the stop zone.

enter image description here

You can smooth the cs data and use something from scipy.optimize to spot the position of extremes. Should you have a problem in this last step please ask another question.

gboffi
  • 22,939
  • 8
  • 54
  • 85
  • Thank you for taking time out for this! I will try this out on my data and see what results I get! – Darpan May 16 '19 at 06:30