-1

I wanted to program bucket sort for a normal distribution (instead of the uniform distributed generally assumed by bucket sort). When I looked into the math, I came up with the following:

def bucket_sort(numbers):
    mean = sum(numbers) / len(numbers)
    variance = sum((x - mean) ** 2 for x in numbers) / len(numbers)
    std_deviation = math.sqrt(variance)

    num_buckets = int( math.sqrt(len(numbers)) )
    buckets = [[] for _ in range(num_buckets)]

    for num in numbers:
        index = int( ( (num - mean) / std_deviation) * num_buckets)
        buckets[index].append(num)

    for bucket in buckets:
        bucket.sort()

    sorted_numbers = [num for bucket in buckets for num in bucket]

    return sorted_numbers

For simplicity I assume, that numbers only consists of positive integers. Is this adaption of bucket sort correct? I was not able to find good literature for that. Additionally I assume, that this version can be speed up, by using a smaller sample to compute mean and variance. Is this assumption also correct?

Titanlord
  • 143
  • 7
  • Try with [1,2,4,3,2,1] and note the IndexError exception – DarkKnight May 23 '23 at 12:26
  • Hmm, but how do I fix that? It seams, that the math does not work as intended, but I don't understand it... – Titanlord May 23 '23 at 12:29
  • 1
    Bucket sort actually cuts the [cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) at equidistant levels. Just in case of uniform distribution the effect looks exactly like if the sample space would have been cut into equidistant ranges (the first and last buckets still stretch to infinity, but there're no such samples). So what you have to do is finding out how many buckets you want (happened already), divide the range 0...1 into that many levels, and then find where the distribution function crosses those levels, to find bucket boundaries. – tevemadar May 24 '23 at 11:58
  • Using the cumulative distribution function is the correct approach! But the math behind that is a bit tricky, though – Titanlord May 24 '23 at 12:55

1 Answers1

0

I got an answer. The problem is, that the original attempt kind of normalises the values, without doing anything further.

On a mathematical / stochastic level we need the cumulative distribution function CDF. The gaussian distribution defines a probability density function PDF and we can convert that to a uniform distribution using the CDF, which is the integral over the PDF and creates the uniform distributed output, because of a phenomena called probability integral function. Here is the resulting code:

import math
import scipy.stats

def bucket_sort(numbers, mean, deviation):
    # Just recompute here to see what happens
    mean = sum(numbers) / len(numbers)
    variance = sum((x - mean) ** 2 for x in numbers) / len(numbers)
    deviation = math.sqrt(variance)

    num_buckets = int( math.sqrt(len(numbers)) ) 
    buckets = [[] for _ in range(num_buckets)]

    for num in numbers:
        # index = scipy.stats.norm.cdf(num, mean, deviation)
        tmp = (num - mean)/(deviation*math.sqrt(2))
        index = 0.5 *(1+math.erf(tmp))
        index = int(index * num_buckets)

        # Append rescaled value to bucket
        buckets[index].append(num)

    for bucket in buckets:
        bucket.sort()

    sorted_numbers = [num for bucket in buckets for num in bucket]

    return sorted_numbers

# example call
mean = 50
std = 20
input = scipy.stats.norm.rvs(50,20,100)
input = [int(i) for i in input]
print(bucket_sort(input,mean, std))

Note: The CDF for the normal distribution is related to the error function, which can be approximated, or as I do, be computed by a math package. The CDF outputs values between 0 and 1 (because it is the integral over a probability distribution). We therefore need a rescale to bucket indices, which is done using the size of the buckets.

Titanlord
  • 143
  • 7