How to estimate percentiles on streaming data. (Identifying equally sized bins of numbers in a stream of data in python)

Question

Peer summary: HMGHaly wants to find the locations of equally spaced percentiles on a data stream. The bins HMGHaly is after should therefore contain roughly the same number of data points, and are therefore not expected to have the same distance between the bin boundaries. The size as HMGHaly uses it refers to the number of data points in the bin not the width of the bin.

I have an iterable of numbers which I cannot fully load in memory, and I want to split these numbers into bins of equal size, meaning that if I want to sort all these numbers and split them into for example 10 groups/bins, what is the lowest value and highest value of each bin.

It is quite easy to identify the mean by counting and adding the numbers so far. It is also quite easy to get the minimum and maximum value so far, but this kind of splitting seems challenging.

I have a few ideas:

If I'm not restricted by the memory, I can load all the numbers into a list, sort the list, and then split it into equal sized smaller lists, while easily identifying the boundary values of each small list, but this is not applicable here.
I can try to sort the huge iterable list somehow and then deal with it as a sorted list, but the issue is that I will have to do this for many different values I have to process simultaneously (numbers under each column)
I can identify the running mean and standard deviation, similar to this answer. Then I can split the bins into how many standard deviations or fractions of standard deviations around the mean. However, I tried implementing this answer, but for some reason when I subtracted the standard deviation from the mean, the value was less than the minimum value, so I think there might be an issue with data distribution, maybe skewed towards higher values than lower ones, but at the end of the day using standard deviation didn't help.

So, the question is here as follows:

given an iterable of tens of millions of numbers, and say that we want to split them into N bins (e.g. 10 bins) of equal size, how can we identify the upper-bound value and lower-bound value of each bin, without loading all these numbers in memory

Edit The bin splitting process is as follows, for simple in-memory list sorting/splitting/binning:

import random
list1=[random.randint(0,20) for i in range(100)]
list1.sort()
print("full list:",list1)
n_intervals=10
interval_size=int(len(list1)/n_intervals)
for i0 in range(n_intervals):
  small_list1=list1[interval_size*i0:interval_size*(i0+1)]
  bounds=(small_list1[0],small_list1[-1])
  print("small_list # %s"%i0,  small_list1,"size:",len(small_list1), "bounds:", bounds)

Output

full list: [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 13, 13, 14, 14, 14, 14, 14, 14, 15, 15, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20]
small_list # 0 [0, 0, 0, 1, 1, 1, 1, 2, 2, 2] size: 10 - bounds: (0, 2)
small_list # 1 [2, 2, 2, 2, 3, 3, 3, 3, 4, 4] size: 10 - bounds: (2, 4)
small_list # 2 [4, 5, 5, 5, 5, 5, 5, 5, 5, 6] size: 10 - bounds: (4, 6)
small_list # 3 [6, 6, 6, 6, 7, 7, 7, 7, 7, 7] size: 10 - bounds: (6, 7)
small_list # 4 [7, 8, 8, 8, 8, 8, 8, 8, 8, 9] size: 10 - bounds: (7, 9)
small_list # 5 [9, 9, 9, 10, 10, 10, 10, 11, 11, 11] size: 10 - bounds: (9, 11)
small_list # 6 [11, 12, 12, 12, 12, 12, 12, 13, 13, 14] size: 10 - bounds: (11, 14)
small_list # 7 [14, 14, 14, 14, 14, 15, 15, 16, 16, 16] size: 10 - bounds: (14, 16)
small_list # 8 [16, 16, 16, 16, 17, 17, 17, 18, 18, 18] size: 10 - bounds: (16, 18)
small_list # 9 [19, 19, 19, 19, 19, 19, 19, 20, 20, 20] size: 10 - bounds: (19, 20)

Further edit: To be fully clear, I need something like the following. It is very easy to get the mean, min and max, but the question now is how to define the boundary values that can split all the values into bins of equal size, while calculating them as a stream of running values, without having to store the running values in memory.

import random
random.seed(0)
count0=0
sum0=0
running_min0=None
running_max0=None

def get_bin_boundaries(n_bins=5): #The function I need, it can take any arguments
  return #and return a list of boundary values corresponding to n_bins+1 e.g. [0,3,7,9,11,15]

for i in range(100000000):
  cur_number=random.randint(0,20)
  count0+=1
  sum0+=cur_number
  running_mean0=sum0/count0
  if running_min0==None or running_min0>cur_number:running_min0=cur_number
  if running_max0==None or running_max0<cur_number:running_max0=cur_number
  running_bin_boundaries=get_bin_boundaries() #This is what I need
  #print("cur_number",cur_number,"running_mean0",running_mean0,"running_min0",running_min0,"running_max0",running_max0)

Have you tried pyspark? Its main purpose is specifically to parallelize calculations on data that are too big to be loaded in memory and that need to be chunked. — Stef, Sep 14 '22 at 15:16
Actually, calculating a running max and a running min is even easier than calculating a running mean and a running standard deviation, so I'm not sure what your question is. — Stef, Sep 14 '22 at 15:23
@Stef I would prefer something pure python, if possible. And It's always possible to do chunking anyway, but the challenge here is to calculate it on the running data/stream of numbers, as it has been done for mean, max, min, and stdev — hmghaly, Sep 14 '22 at 15:24
The standard way to calculate a max is a running way: you have a variable `m` initialised as the first element, and every time a new element `x` arrives, you update `m = max(m, x)`. — Stef, Sep 14 '22 at 15:24
I don't understand the question. Has the iterable been split into bins yet? If yes, what's stopping you from computing the max and min of every bin? If not, what is the splitting method? — Stef, Sep 14 '22 at 15:25
I agree, both min, max, mean and stdev can be applied easily on a running stream of data — hmghaly, Sep 14 '22 at 15:25
No, it's not split into bins yet. I can get the max and mins but they can be outliers. So if I split according to max and min, some of the bins would have 1-2 values, and others can have millions — hmghaly, Sep 14 '22 at 15:27
Please clarify the question. *"how can we identify the upper-bound value and lower-bound value of each bin"* makes no sense if you don't tell us more about the bin-splitting process. — Stef, Sep 14 '22 at 15:28
Oooooooh okay. So it looks like you want to do a distributed sorting algorithm. — Stef, Sep 14 '22 at 15:54
Sorting is one of the options, but it gets complicated with many numbers we need to process at the same time, so that's why I wanted to kind of running algorithm, similar to min/max/mean/stdev — hmghaly, Sep 14 '22 at 15:56
This question is not clear, and that is the reason it lost traction. Please clarify: 1. do you know how many bins you want, or how big they can be? 2. what happens if the total stream size is not divisible by the number of bins? — Rodrigo Rodrigues, Sep 18 '22 at 09:38
You know what, nevermind. If you don't know the sequence in advance, it is impossible to decide beforehand, which bin a value should be part of. Unless, of course, you can go storing all bins in memory and manipulate on every new value, with totally defeats the purpose of having a stream. This question, as stated, doesn't make sense. — Rodrigo Rodrigues, Sep 18 '22 at 09:39
How small is your memory? Tens of millions doesn't sound like it should be a problem to put it all into memory. Also, how large are your numbers? (Surely not really ints from 0 to 20, right?) — Kelly Bundy, Sep 19 '22 at 02:39
You say "equal" multiple times but then approved an edit introducing "roughly the same". Which is it? — Kelly Bundy, Sep 20 '22 at 12:43
@KellyBundy "roughly the same" means the number of data points in two bins can differ by 1. It is not a given that the number of data points is divisible by the number of bins. Therefore we cannot guarantee that each bin contains the same amount of data points. — Ronald van Elburg, Sep 21 '22 at 12:33
@RonaldvanElburg Where can I see that definition? I think for example 735 and 738 are roughly the same. — Kelly Bundy, Sep 21 '22 at 13:34
@KellyBundy The moderators did not accept me adding this definition. But as the author of the Peer summary I feel I can explain my own words without reference to an external source. Are you willing to accept that exactly the same is not achievable without extra requirements on the input. — Ronald van Elburg, Sep 21 '22 at 14:04
@RonaldvanElburg So it's just what *you* meant. I don't think that's how it's generally understood. Does your answer's solution achieve it? Making the bin sizes differ by at most 1? From your description it doesn't sound like it, and some testing I tried also didn't make it look that way, not even remotely close, but I might've tested wrongly. No, I don't agree that it's not achievable. But I'm still waiting for their answers to my questions which they seem to have ignored. — Kelly Bundy, Sep 21 '22 at 14:13
@KellyBundy You are right my solution is an approximation, but I am convinced you need to resort to some kind of approximation if you don't want to keep all the data in memory. The solution presented by Mustafa Gamal offers you a more precise answer, but at huge computational costs. So in the end it depends on how you want to use the bin boundaries. On very general grounds I think that percentiles outside the interval [1/N_data_points, 1-1/N_data_points] yield unreliable values. — Ronald van Elburg, Sep 21 '22 at 16:22
@KellyBundy I updated my answer after your questions forced me to rethink it ;-) I am convinced that the error intervals that are reported now contain the answer obtained using an implementation of Mustafa Gamal's answer. For the boundary cases it will now print a warning message. — Ronald van Elburg, Sep 21 '22 at 18:52
Thank you @RodrigoRodrigues, I don't think it is impossible to decide beforehand, which bin a value should be part of. Other answers have provided solutions for that and I posted a solution for that and it's actually very simple — hmghaly, Sep 23 '22 at 13:19
@KellyBundy I think the whole idea is to make bin sizes comparable, probably equal was a stretch. So I think some of the answers have given some directions in this regard, and it helped me get to a solution that I posted as an answer. — hmghaly, Sep 23 '22 at 13:21
@RonaldvanElburg thanks for the summary and your answer, these are very helpful. Yes, it's always an approximation when you're talking about this magnitude and stream of data. — hmghaly, Sep 23 '22 at 13:23

score 1 · Answer 1 · answered Sep 17 '22 at 18:56

I think you will need to sort the stream and you can achieve this (and I am here assuming you know the number of items in the stream and that your memory can handle at least two bins at a time) by doing the following

store each bin into disk [bin_size = number_of_items_in_stream /number_of_bins]
after the end of the stream load each bin into memory and sort it then store it again into disk while saving the name of the bin and it's min and max values in a data structure that contains these values in addition to the name of each bin.
in the data structure sort the bins names according to their min value.
from step 3 you can identify which bins intersect with each other.
loop over the data structure and load every two intersecting bins into memory and interchange their values with each other so that the two bins won't have any intersecting values at the end.
after step 5 update the min and max values of the two bins in the data structure to be equal to the updated min and max values.
your stream is now sorted.

Thanks Mustafa, that was helpful for me to find a solution. I posted it as another answer. — hmghaly, Sep 23 '22 at 13:10

pandayo · Answer 2 · 2022-09-23T06:58:39.647

If I understood your question correctly, couldn't you use a default dict to count the appearances of each value? Afterwards you have a huge dict but you could iterate over the sorted keys and create your bins like this? You could even recalculate the bins for every new number in your stream, but I would deem this unneccessary.

To put it into code:

import random
import copy
from math import ceil
from collections import defaultdict
random.seed(0)
count0=0
sum0=0
running_min0=None
running_max0=None

appearances = defaultdict(int)

def get_bin_boundaries(appearances, count, min, max, n_bins=5):
    if count < n_bins:
        n_bins = count
    keys = appearances.keys()
    keys = sorted(keys)
    bin_size = int(ceil(count / n_bins))
    boundries = [min]
    key_index = 0
    removed = 0
    taken = 0
    while len(boundries) <= n_bins :
        cur_val = appearances[keys[key_index]]
        if cur_val - removed > bin_size - taken:
            removed += (bin_size - taken)
            boundries.append(keys[key_index])
            taken = 0
            continue
        if cur_val - removed == bin_size - taken:
            removed = 0
            taken = 0
            boundries.append(keys[key_index]) 
            key_index += 1
            continue
        taken += (cur_val-removed)
        removed = 0
        key_index +=1
        if key_index >= len(keys):
            boundries.append(keys[key_index-1])
    boundries.append(max)
    return boundries


for i in range(1000000):
    cur_number=random.randint(0,20)
    count0+=1
    sum0+=cur_number
    running_mean0=sum0/count0
    if running_min0==None or running_min0>cur_number:running_min0=cur_number
    if running_max0==None or running_max0<cur_number:running_max0=cur_number
    appearances[cur_number]+=1
    if i % 100000 == 0:
        print(i, get_bin_boundaries(appearances, count0, running_min0, running_max0))

keep in mind that this is slow to compute.

Precisely! That's actually the direction I want to take. I made some modifications to it and posted it as an answer below, thank you so much! — hmghaly, Sep 23 '22 at 13:11

score 0 · Answer 3 · answered Sep 17 '22 at 18:05

If you know the expected length of input beforehand, it would be pretty easy if I understand you correctly:

import random
random.seed(0)
count0=0
sum0=0
running_min0=None
running_max0=None
len=100000000

def get_bin_boundaries(n_bins=5): #The function I need, it can take any arguments
  res = []
  i = 0
  while i < len:
    res.append(i)
    i += int(len/n_bins)
  res.append(len-1)
  return res#and return a list of boundary values corresponding to n_bins+1 e.g. [0,3,7,9,11,15]

for i in range(len):
  cur_number=random.randint(0,20)
  count0+=1
  sum0+=cur_number
  running_mean0=sum0/count0
  if running_min0==None or running_min0>cur_number:running_min0=cur_number
  if running_max0==None or running_max0<cur_number:running_max0=cur_number
  running_bin_boundaries=get_bin_boundaries() #This is what I need

No I don't know the expected length in advance, but than you for the ideas and directions, they helped me get a solution that I posted as answer here — hmghaly, Sep 23 '22 at 13:13

Anas Naguib · Answer 4 · 2022-09-18T21:57:18.553

0

You should use Python with Apache Spark, doing this operation with python only will consume a lot of time and will not be an efficient way.

https://spark.apache.org/

Another way to try pandas if you need to work with python only. https://pandas.pydata.org/

edited Sep 18 '22 at 21:57

answered Sep 17 '22 at 21:28

Anas Naguib

1,006
11
12

Thanks Anas, but I'm trying to avoid any solution that requires loading the entire data at once – hmghaly Sep 23 '22 at 13:12

andre_lamothe · Answer 5 · 2022-09-20T07:43:39.763

0

What you need is a spatial data structure, you can use Quadtree, it will subdivide the space into quads, based on distance between the points, you can "tile" all the bins into tiles. You can use any grouping algorithm like Euclidian distance or hamming or min/max to divide the space. After tiling each bin you can find the intersection in N log N by walking through down the tree. You can also query for any point,..etc a lot of features. https://en.wikipedia.org/wiki/Quadtree or easier approach: https://en.wikipedia.org/wiki/K-d_tree

edited Sep 20 '22 at 07:43

answered Sep 20 '22 at 07:36

andre_lamothe

2,171
2
41
74

That's an interesting approach, thanks Ahmed. I think it can also be helpful in other situations, but for this I have found a solution that I posted as an answer below – hmghaly Sep 23 '22 at 13:14

Ronald van Elburg · Answer 6 · 2022-09-22T08:58:36.943

The problem can not be solved exactly, but given some constraints we can try to solve it in good approximation.

It is important to know beforehand in which range the data can almost certainly be found. So an order of magnitude estimate of data values should be known.

Suppose we have data and we know that the majority of the data points is almost certainly in the range [a_min, a_max] then we can:

- bin the data into very narrow bins creating a histogram in the process
- subsequently calculate the the cumulative distribution function 
- find the points where cumulative distribution function reaches the 
percentiles of interest

In code:

import numpy as np

# Function to check if x is power of 2
# https://www.geeksforgeeks.org/python-program-to-find-whether-a-no-is-power-of-two/
def isPowerOfTwo(n):
    if (n == 0):
        return False
    while (n != 1):
            if (n % 2 != 0):
                return False
            n = n // 2
             
    return True


class percentileEstimator():
    ''' This algorithm assumes there are 2^N bins separated bu 2^N-1 bin
         boundaries (where N is a natural number).
         
         We start from the middle bin boundary and search from there a bin
         boundary neighbouring the correct bin. Then we do a last check on
         the value to decide whether the lower neighbouring bin or the higher
         neighbouring bin is the correct bin for the value.
         
         Bin boundary is included in neighbouring bin at higher values.
         The first and last bin contain values before repectively
         after the last specified bin boundary.
           
    '''


    def __init__(self, bin_boundaries):
        
        if not isPowerOfTwo(len(bin_boundaries)+1):
            raise ValueError('percentileEstimator: Number of bins is not a power of 2')
            
        self.bin_boundaries = bin_boundaries
        self.bin_count = len(bin_boundaries) + 1
        self.histogram = np.zeros((self.bin_count,), dtype=np.int64)
        self.datapoint_count = 0   
       
    def getBinIndex(self, value):
       
        position = int(self.bin_count/2)  # For mathematical reasons we start positions at 1
        step = int(self.bin_count/4)

        while (step > 0):
            
            if(value < self.bin_boundaries[position-1]):
                position -= step
            else:
                position += step
            
            step = step//2  #int(step//2)
            
        # Are we lower or higher than the last bin boundary
        if(value < self.bin_boundaries[position-1]):
            index = position-1
        else:
            index = position
        
        return index

    def update(self, data):
        for datapoint in data:
            index = self.getBinIndex(datapoint)
            self.histogram[index] +=1
        print(self.histogram)

    def getPercentiles(self, percentile_list):
        '''
        Calculate approximate percentile location:
        
            In: 
                percentile_list: list percentiles
            
            Out:
                percentiles: estimated value associated with the percentile
                error_intervals: interval in which we are certain the percentile value can be found
        '''
        
        cumulative_distribution = np.cumsum(self.histogram)
        percentile_locations = list()
                
        if cumulative_distribution[0] > 0:
            print(f'There are {cumulative_distribution[0]} data points below the specified minimum')
            
        if cumulative_distribution[-2] != cumulative_distribution[-1]:
            print(f'There are {cumulative_distribution[-1] - cumulative_distribution[-2]} data points above the specified maximum')
            
        for percentile in percentile_list:
            if percentile <= 1/cumulative_distribution[-1]:
                print(f'percentile requested {percentile} is to small for the data set provided, percentile value should be larger than {1/cumulative_distribution[-1]} ')
            elif  percentile >= 1-1/cumulative_distribution[-1]:
                print(f'percentile requested {percentile} is to large for the data set provided, percentile value should be smaller than {1-1/cumulative_distribution[-1]}')
        
        for percentile in percentile_list:
            percentile_loc_left = np.searchsorted(cumulative_distribution[1:-2], percentile*cumulative_distribution[-1], side ='left')
            percentile_locations.append(percentile_loc_left)                             
        
        percentiles = np.array([(self.bin_boundaries[location]+self.bin_boundaries[location+1])/2 for location in percentile_locations])
        
        error_intervals = np.array([[self.bin_boundaries[location], self.bin_boundaries[location+1]] for location in percentile_locations])
                
        return percentiles,  error_intervals
    

# test the class a bit

def test_index_function():
    pE = percentileEstimator(np.array([0, 13, 27], 'float'))
    values = [-1.0, 0, 0.5, 13, 13.5, 27, 27.5, 12.9]
    desired_outcomes = np.array([0, 1, 1, 2, 2, 3, 3, 1])
    actual_outcomes = np.array([pE.getBinIndex(value) for value in values])
    np.testing.assert_equal(actual_outcomes, desired_outcomes)

test_index_function()

def test1_getpercentile_function():
    pE = percentileEstimator(np.array([1,2,3,4,5,6,7], 'float'))
    
    values = np.array([2.4,]*40 + [5.1]*50 + [6.5]*10)
    
    pE.update(values)
    
    percentiles = [0.3,0.4,0.5,0.9]
    percentile_values, error_intervals = pE.getPercentiles(percentiles)
    
    print(f'{percentile_values=}')
    
    percentile_values_expected = np.array([2.5, 2.5, 5.5, 5.5])
    
    error_intervals_expected = np.array([[2., 3.],
       [2., 3.],
       [5., 6.],
       [5., 6.]])
    
    np.testing.assert_equal(percentile_values, percentile_values_expected)
    np.testing.assert_equal(error_intervals, error_intervals_expected)
                   
test1_getpercentile_function() 


a_min = 0
a_max = 10
step = (a_max-a_min)/510

bin_boundaries = np.arange(a_min, a_max+step,step)

pE = percentileEstimator(bin_boundaries)
test_data_mean = 7
test_data_sigma = 0.5
test_data = np.random.randn(1000)*test_data_sigma + test_data_mean


pE.update(test_data)

percentiles, error_intervals = pE.getPercentiles([0.1, 0.5, 0.9,])

print(f'{percentiles=}\n{ error_intervals=}')

For me this produces something like:

percentiles=array([6.30392157, 6.99019608, 7.6372549 ])
error_intervals=array([[6.29411765, 6.31372549],
                       [6.98039216, 7.        ],
                       [7.62745098, 7.64705882]])

This code shows the principle. It can probably be sped up but it is reasonably efficient as is.

Calling pE.getPercentiles with equally spaced percentiles returns the bin boundaries of interest:

N=10
equally_spaced_percentiles = (np.arange(0,N-1)+1)/N

print(f'{equally_spaced_percentiles=}')
percentiles, error_intervals = pE.getPercentiles(equally_spaced_percentiles)

print(f'bin boundaries: {percentiles}')

For the Gaussian example I get (in 1 of the runs):

bin boundaries: [6.30392157 6.53921569 6.69607843 6.85294118 6.99019608 7.14705882
 7.30392157 7.46078431 7.6372549 ]

The internal bins are equally spaced in the example calls. This is good for the example where we apply the code to a data drawn from a Gaussian distribution. If however we deal with a ratio scale, for example dealing with the energy in sound, it is possible to take the logarithm and then use equally spaced bin boundaries. Or alternatively, and in my opinion more efficiently, it is possible to choose the bin boundaries log-linearly and avoid the expensive log-function.

Comment: If the precentile value is in a region with very few data points it can vary wildly. The 0.5 value is therefore closer to the value of the 50th percentile of the distribution from which the data points are drawn than the 0.9 value. — Ronald van Elburg, Sep 20 '22 at 08:31
The answer above is calculated over the whole history of the stream. It is not hard to make it windowed so it becomes a running percentile estimator. — Ronald van Elburg, Sep 23 '22 at 07:07
I like this direction, and we get a histogram for free! It helped me think of a solution that I posted as an answer here. Thank you so much! — hmghaly, Sep 23 '22 at 13:17

score 0 · Answer 7 · answered Sep 23 '22 at 13:09

I have received a considerable number of answers here and elsewhere with many great suggestions. I eventually figured out a solution that is informed by many of these suggestions, so I'm posting it here. The idea is as follows:

Iterate over the numbers, and the first number that is non-zero is the reference value
For each subsequent number, identify the relative difference between it and the reference value (number - ref) / ref
Get the relative difference as a percentage by multiplying with 100
Normalize the percentage value around a certain step (e.g. 5% - 10%), so that for example the percentage differences of 3,4,5,6,7 correspond to 5%, while 8,9,10,11,12 would correspond to 10%
Now count the number of occurrences of numbers that correspond to each normalized percentage value
Based on the original reference value, and the normalized percentage, and the counts of numbers corresponding to the normalized percentage, we can identify raw bins (high and low values with the corresponding count of occurrences)
Then we can sort these numbers and combine the bins as needed to make them comparable-size bins

Here is the code I used:


import random
random.seed(0)
count0=0
sum0=0
running_min0=None
running_max0=None

ref_value=None
size0=5 #to normalize the percentage values
diff_counter={}
for i in range(100000):
  cur_number1=random.randint(0,20)
  cur_number2=random.random()
  cur_number=cur_number1*cur_number2
  #cur_number=cur_number2+2
  if cur_number!=0 and ref_value==None: ref_value=cur_number #we identify the first non-zero value as our reference value
  rel_diff=(cur_number-ref_value)/ref_value #we get the relative difference between the current number and reference value
  diff_percent=round(rel_diff*100) #we turn it into a percentage
  normalized_diff_percent=round(diff_percent/size0)*size0 #we normalize the percentage as needed, so that e.g. 5,6,7,8,9,10,11,12,13,14 would correspond to 10
  diff_counter[normalized_diff_percent]=diff_counter.get(normalized_diff_percent,0)+1 #we increment the count of items with this normalized percentage from the reference value

  count0+=1
  sum0+=cur_number
  running_mean0=sum0/count0 #we get the running mean value
  if running_min0==None or running_min0>cur_number:running_min0=cur_number #and the running minimum value
  if running_max0==None or running_max0<cur_number:running_max0=cur_number #and the running maximum value

counted_percent_items=list(diff_counter.items()) #now we list the normalized percentage values and their corresponding item count
counted_percent_items.sort()

for a,b in counted_percent_items:
  #cur_ratio=0.01*a
  cur_low_percent0,cur_high_percent0=round(a-0.5*size0),round(a+0.5*size0)
  cur_low_val=(0.01*cur_low_percent0*ref_value)+ref_value
  cur_high_val=(0.01*cur_high_percent0*ref_value)+ref_value
  print("Bin around %s percent - count: %s - boundaries: %s <= X < %s"%(a,b,round(cur_low_val,4),round(cur_high_val,4)))
  #print(a,b,"boundary0,boundary1",cur_low_percent0,cur_high_percent0,"cur_low_val",round(cur_low_val,4),"cur_high_val",round(cur_high_val,4))
print("ref_value",ref_value)
print("running_mean0",round(running_mean0,4),"running_min0",round( running_min0,4),"running_max0",round(running_max0,4))

and here is the output:

Bin around -100 percent - count: 8671 - boundaries: -0.1819 <= X < 0.1819
Bin around -95 percent - count: 7682 - boundaries: 0.1819 <= X < 0.7276
Bin around -90 percent - count: 7251 - boundaries: 0.7276 <= X < 1.0915
Bin around -85 percent - count: 5634 - boundaries: 1.0915 <= X < 1.6372
Bin around -80 percent - count: 5485 - boundaries: 1.6372 <= X < 2.001
Bin around -75 percent - count: 4599 - boundaries: 2.001 <= X < 2.5467
Bin around -70 percent - count: 4560 - boundaries: 2.5467 <= X < 2.9105
Bin around -65 percent - count: 3913 - boundaries: 2.9105 <= X < 3.4563
Bin around -60 percent - count: 3913 - boundaries: 3.4563 <= X < 3.8201
Bin around -55 percent - count: 3527 - boundaries: 3.8201 <= X < 4.3658
Bin around -50 percent - count: 3224 - boundaries: 4.3658 <= X < 4.7296
Bin around -45 percent - count: 3013 - boundaries: 4.7296 <= X < 5.2754
Bin around -40 percent - count: 2798 - boundaries: 5.2754 <= X < 5.6392
Bin around -35 percent - count: 2797 - boundaries: 5.6392 <= X < 6.1849
Bin around -30 percent - count: 2491 - boundaries: 6.1849 <= X < 6.5487
Bin around -25 percent - count: 2530 - boundaries: 6.5487 <= X < 7.0945
Bin around -20 percent - count: 2072 - boundaries: 7.0945 <= X < 7.4583
Bin around -15 percent - count: 2192 - boundaries: 7.4583 <= X < 8.004
Bin around -10 percent - count: 1898 - boundaries: 8.004 <= X < 8.3678
Bin around -5 percent - count: 1905 - boundaries: 8.3678 <= X < 8.9135
Bin around 0 percent - count: 1755 - boundaries: 8.9135 <= X < 9.2774
Bin around 5 percent - count: 1643 - boundaries: 9.2774 <= X < 9.8231
Bin around 10 percent - count: 1491 - boundaries: 9.8231 <= X < 10.1869
Bin around 15 percent - count: 1514 - boundaries: 10.1869 <= X < 10.7326
Bin around 20 percent - count: 1426 - boundaries: 10.7326 <= X < 11.0965
Bin around 25 percent - count: 1278 - boundaries: 11.0965 <= X < 11.6422
Bin around 30 percent - count: 1203 - boundaries: 11.6422 <= X < 12.006
Bin around 35 percent - count: 1081 - boundaries: 12.006 <= X < 12.5517
Bin around 40 percent - count: 1076 - boundaries: 12.5517 <= X < 12.9155
Bin around 45 percent - count: 907 - boundaries: 12.9155 <= X < 13.4613
Bin around 50 percent - count: 865 - boundaries: 13.4613 <= X < 13.8251
Bin around 55 percent - count: 803 - boundaries: 13.8251 <= X < 14.3708
Bin around 60 percent - count: 683 - boundaries: 14.3708 <= X < 14.7346
Bin around 65 percent - count: 683 - boundaries: 14.7346 <= X < 15.2804
Bin around 70 percent - count: 545 - boundaries: 15.2804 <= X < 15.6442
Bin around 75 percent - count: 534 - boundaries: 15.6442 <= X < 16.1899
Bin around 80 percent - count: 465 - boundaries: 16.1899 <= X < 16.5537
Bin around 85 percent - count: 447 - boundaries: 16.5537 <= X < 17.0995
Bin around 90 percent - count: 357 - boundaries: 17.0995 <= X < 17.4633
Bin around 95 percent - count: 358 - boundaries: 17.4633 <= X < 18.009
Bin around 100 percent - count: 198 - boundaries: 18.009 <= X < 18.3728
Bin around 105 percent - count: 224 - boundaries: 18.3728 <= X < 18.9185
Bin around 110 percent - count: 151 - boundaries: 18.9185 <= X < 19.2824
Bin around 115 percent - count: 102 - boundaries: 19.2824 <= X < 19.8281
Bin around 120 percent - count: 56 - boundaries: 19.8281 <= X < 20.1919
ref_value 9.09545283528363
running_mean0 4.9789 running_min0 0.0 running_max0 19.9924

This approach needs very little memory, basically for a dictionary with some keys for counting relative percentages. And it can give us real time picture of the distribution of values around our reference value. To Do:

Need to account for the case where the first value(s) being zero, so we need to count them relative to the reference value.
Need to process the relative percentage bins to get comparable size bins eventually.
Need to identify a reasonable way of identifying a good normalizing value for percentages

"_Iterate over the numbers, and the first number that is non-zero is the reference value_". You _may_ need in fact __infinite__ memory until you get the first non-zero (without assuming anything else about your data). — Rodrigo Rodrigues, Sep 23 '22 at 13:25
why do I need infinite memory? I will just be incrementing the counter with every new zero value, then once I get a non-zero, that would be my reference — hmghaly, Sep 23 '22 at 13:59
So are your values really just `floats`? Then I really don't believe that you can't load tens of millions into memory. Sadly you never answered my [questions](https://stackoverflow.com/questions/73719193/how-to-estimate-percentiles-on-streaming-data-identifying-equally-sized-bins-o/73758652?noredirect=1#comment130259832_73719193)... — Kelly Bundy, Sep 23 '22 at 15:11
@KellyBundy the requirement is explicitly not to load the data into memory, we have dozens of columns and each column has tens and maybe hundreds of millions of numbers. We can't and shouldn't load this into memory for practical and scalability reasons. — hmghaly, Sep 23 '22 at 15:17
Ah, ok. Still, I'd maybe do one pass to store each column in its own file, then process the columns separately. With external sorting if even the single column's data doesn't fit into memory. — Kelly Bundy, Sep 23 '22 at 15:30
Yes, that would be the last resort if nothing else works, but I think that from the answers and suggestions here we have a couple of solutions that would yield reasonable results without having to put anything in memory or on desk — hmghaly, Sep 23 '22 at 15:40
"_I will just be incrementing the counter with every new zero value_". And what are you doing with those non-zero values? Losing them by not storing anywhere? — Rodrigo Rodrigues, Sep 23 '22 at 16:25
@RodrigoRodrigues, no, I will start getting their normalized relative difference percentages and counting these in the diff_counter dictionary in the code above (based on the reference value). I am only interested in the first non-zero value, because this is the reference value that I'm comparing every subsequent value to it — hmghaly, Sep 23 '22 at 16:39
I am talking about the values streamed __before__ you get your 0. You cannot calculate any relative on them yet. Are you ignoring/losing them? — Rodrigo Rodrigues, Sep 23 '22 at 18:28
@RodrigoRodrigues I think there are 2 scenarios here: 1) the first value in the steam is non-zero, and in this case, it will simply be the reference value, 2) the first value(s) in the stream are zeros, then we wait till we get a non-zero value, and make it our reference value — hmghaly, Sep 23 '22 at 18:42
That is what I am trying to bring attention to: "_wait till we get a non-zero value_". What do you do with the streamed 0 values prior to the first non-zero? Just discard them? — Rodrigo Rodrigues, Sep 23 '22 at 19:12
@RodrigoRodrigues, no, just count them, and when you get the reference value, calculate the normalized percent difference (between zero and the reference value), and add the count of it to the diff_counter — hmghaly, Sep 23 '22 at 21:16

How to estimate percentiles on streaming data. (Identifying equally sized bins of numbers in a stream of data in python)

7 Answers7