First, you seem to have an indentation error - for val in data.values():
should not be nested within for bins in range(0, max_salary+1, bin_width):
- that's why you are getting a longer list of values.
Second, your logic is a bit off in various ways - you keep a count variable that is only set to zero once, at the beginning of the function. for n in bin_list:
loops through the values in bin_list
, but you then multiply n
by bin_width
, which doesn't make sense. You could alter this using range(n_bins)
to go through the indices of bin_lists
, like this:
def wealth_distribution(data, n_bins, max_salary):
sal_list = [0] * n_bins
bin_list = []
bin_width = int(max_salary/n_bins)
for bins in range(0, max_salary+1, bin_width):
bin_list.append(bins)
for val in data.values():
if val['salary'] == None:
continue
for i in range(n_bins):
if math.floor(i*bin_width)<=float(val['salary'])<math.floor((i+1)*bin_width):
sal_list[i] += 1
return sal_list
But on closer inspection, bin_list
is actually serving no purpose here. The function can be reduced to:
def wealth_distribution(data, n_bins, max_salary):
sal_list = [0] * n_bins
bin_width = max_salary/n_bins
for val in data.values():
if val['salary'] == None:
continue
bin_index = int(float(val["salary"]) / bin_width)
if bin_index < n_bins:
sal_list[bin_index] += 1
else: # salary = max_salary
sal_list[n_bins-1] += 1
return sal_list
The function above calculate the bin index rather than looping through the bins or indices. I also removed the math.floor
s as these seem unnecessary and could lead to some situations where a small rounding error would leave some salaries uncategorised.
You could simplify further using collections.Counter
:
from collections import Counter
def wealth_distribution(data, n_bins, max_salary):
bin_width = max_salary / n_bins
bins = Counter(min(int(float(val["salary"]) // bin_width), n_bins-1)
for val in data.values())
return [bins[i] for i in range(n_bins)]
There is a histogram
function in numpy
that also does what you want, and as a bonus provides an array of the bin boundaries.
import numpy as np
salaries = [float(val["salary"]) for val in data.values()]
sal_list, bin_list = np.histogram(salaries, bins=5, range=(0, 100))
And if you want to use pandas
... (might be useful for other operations on the same data)
import pandas as pd
def wealth_distribution(data, n_bins, max_salary):
df = pd.DataFrame(data).transpose()
bin_width = max_salary / n_bins
df["salary_bin"] = (pd.to_numeric(df["salary"]) // bin_width).clip(upper=n_bins-1)
counts = df["salary_bin"].value_counts()
return counts.reindex(range(n_bins), fill_value=0).values