Python: how to split one variable according to bins from another variable?

Question

I want to create an histogram of variable V based on the bins of variable X. To do this, I read an Excel file which looks like this:

Column X   Column V
99.9       0
100.0      3
25.17      2
39.45      1
66.52      1
17.17      6
9.25       2
86.11      3
84.09      3

For each bin of the variable X I want to compute the average of the V values associated with it. For example:

X bin: 0-30 -> avg(V)=(2+6+2)/3=3.33
X bin: 31-80 -> avg(V)=(1+1)/2=1.00
X bin: 81-100 -> avg(V)=(3+3+0+3)/4=2.25

So that I come up with:

X bin    avg(V)
0-30     3.33
31-80    1.00
81-100   2.25

To do this I have written the following code block, in which I use some lists to gather all the V values which fall within each X bins (binwidth=10).

EDIT

I have a problem with the lengths of my lists. With an excel file of 1000 rows, for instance, where there is only 1 V value belonging to the bin 41-50. However, if I enter len(islands_4150)=999. Where does the code get the other 998 values?

from openpyxl import load_workbook
wb = load_workbook(filename = 'myfile.xlsx')
ws=wb.active
cell_range_1 = ws['X2':'X1001']
cell_range_2 = ws['V2':'V1001']
cf_list=[] #List with X values
island_list=[] #List with V values
for row in range(2,1001): 
    for column in 'X':
        cell_name_1="{}{}".format(column, row) #X
        cf_list.append(ws[cell_name_1].value)
        x=map(lambda x: int(x) if x%1==0 else x, cf_list)
        for column in 'V':
            cell_name_2="{}{}".format(column, row) #V
            island_list.append(ws[cell_name_2].value)
            v=map(lambda x: int(x) if x%1==0 else x, island_list)
islands_010=[] #List with values from column V which corresponding values from column X are 0<=value<=10
islands_1120=[]
islands_2130=[]
islands_3140=[]
islands_4150=[]
islands_5160=[]
islands_6170=[]
islands_7180=[]
islands_8190=[]
islands_91100=[]
for i, val in enumerate(x):
    for j, elem in enumerate(v):
        if x[i]>=0 and x[i]<=10:
            islands_010.append(v[i])
        elif x[i]>=11 and x[i]<=20:
            islands_1120.append(v[i])
        elif x[i]>=21 and x[i]<=30:
            islands_2130.append(v[i])
        elif x[i]>=31 and x[i]<=40:
            islands_3140.append(v[i])
        elif x[i]>=41 and x[i]<=50:
            islands_4150.append(v[i])
        elif x[i]>=51 and x[i]<=60:
            islands_5160.append(v[i])
        elif x[i]>=61 and x[i]<=70:
            islands_6170.append(v[i])
        elif x[i]>=71 and x[i]<=80:
            islands_7180.append(v[i])
        elif x[i]>=81 and x[i]<=90:
            islands_8190.append(v[i])
        elif x[i]>=91 and x[i]<=100:
            islands_91100.append(v[i])

if len(islands_010)==0:
    print ('Avg islands 0-10: 0') 
else:
    avg010=round(reduce(lambda x, y: x + y, islands_010) / len(islands_010),3)
    print ('Avg islands 0-10: '+str(avg010))

if len(islands_1120)==0:
    print ('Avg islands 11-20: 0') 
else:
    avg1120=round(reduce(lambda x, y: x + y, islands_1120) / len(islands_1120),3)
    print ('Avg islands 11-20: '+str(avg1120))

if len(islands_2130)==0:
    print ('Avg islands 21-30: 0')
else:
    avg2130=round(reduce(lambda x, y: x + y, islands_2130) / len(islands_2130),3)
    print ('Avg islands 21-30: '+str(avg2130))

if len(islands_3140)==0:
    print ('Avg islands 31-40: 0')
else:
    avg3140=round(reduce(lambda x, y: x + y, islands_3140) / len(islands_3140),3)
    print ('Avg islands 31-40: '+str(avg3140))

if len(islands_4150)==0:
    print ('Avg islands 41-50: 0')
else:
    avg4150=round(reduce(lambda x, y: x + y, islands_4150) / len(islands_4150),3)
    print ('Avg islands 41-50: '+str(avg4150))

if len(islands_5160)==0:
    print ('Avg islands 51-60: 0')
else:
    avg5160=round(reduce(lambda x, y: x + y, islands_5160) / len(islands_5160),3)
    print ('Avg islands 51-60: '+str(avg5160))

if len(islands_6170)==0:
    print ('Avg islands 61-70: 0')
else:
    avg6170=round(reduce(lambda x, y: x + y, islands_6170) / len(islands_6170),3)
    print ('Avg islands 61-70: '+str(avg6170))

if len(islands_7180)==0:
    print ('Avg islands 71-80: 0')
else:
    avg7180=round(reduce(lambda x, y: x + y, islands_7180) / len(islands_7180),3)
    print ('Avg islands 71-80: '+str(avg7180))

if len(islands_8190)==0:
    print ('Avg islands 81-90: 0')
else:
    avg8190=round(reduce(lambda x, y: x + y, islands_8190) / len(islands_8190),3)
    print ('Avg islands 81-90: '+str(avg8190))

if len(islands_91100)==0:
    print ('Avg islands 91-100: 0')
else:
    avg91100=round(reduce(lambda x, y: x + y, islands_91100) / len(islands_91100),3)
    print ('Avg islands 91-100: '+str(avg91100))

I think you meant for column in cell_range_1 instead of for column in 'X'. You're iterating over the string 'X' which contains a single character. Same goes for 'Y'. — Brian Schlenker, Apr 05 '16 at 15:54
No this is not the problem. It works as this is what `openpyxl` suggests to do. Plus I don't get any error for that. — FaCoffee, Apr 05 '16 at 15:59
It might not be the only problem, but it certainly is **a** problem. The line `for column in 'X':` really does mean that you iterate over the 1-character string `'X'`. So the loop executes one time, and during that one iteration, the `column` variable has the value `'X'`. I don't mean the Excel column X, I mean **the letter 'X'**. This is fundamental Python syntax, it doesn't matter whether you are talking about `openpyxl` or anything else. Obviously, you have the same thing with your `for column in 'V':`. — John Y, Apr 05 '16 at 16:08
So how do you suggest to amend the code? Because if I enter `for column in cell_range_1` I get an error: `CellCoordinatesException: Invalid cell coordinates`. — FaCoffee, Apr 05 '16 at 16:12
Honestly, judging by your code, it think you would be best served by really investing the time to learn Python better, to really understand its data structures (such as lists) and its looping. You're throwing a bunch of stuff together without understanding it and just hoping it works. There are a ton of free, on-line tutorials. I personally learned the Python language just by reading the tutorial that is included as part of the standard documentation. — John Y, Apr 05 '16 at 16:41

score 2 · Accepted Answer · answered Apr 05 '16 at 17:03

Your code is, as it stands, rather poorly constructed, and this is clouding the problem.

The first problem is whitespace. You need some.

The next is the with the lines for column in 'X': and for column in 'V':. These two for loops are useless and they can be replaced with:

cell_name_1="X{}".format(row) #X variable
cell_name_2="V{}".format(row) #V variable

Furthermore, I would recommend grabbing the cell values, and then doing all of your comparisons:

x_val =  float(ws[cell_name_1].value)
v_val =  int(ws[cell_name_2].value)

Ranges in python are inclusive on the first number, and exclusive on the last. Therefore the range in your first loop should go to 1002 such that the last row is 1001.

for row in range(2, 1002):

I would recommend using ws = WB.get_sheet_by_name("sheet_name") to retrieve the sheet, rather than ws=wb.active to ensure you consistently get the sheet you want.

Finally, we get to the actual problem. Your current method is reading directly from excel straight into bins. What you should do is read all the data from excel, and then manipulate it to produce the bins you want. The first step is to get the data into a python structure which makes your life easiest, I recommend a list of tuples:

islands.append((x_val,v_val))

This will produce something like this:

[(99.9, 0), (100.0, 3), (25.17, 2), (39.45, 1), (66.52, 1), (17.17, 6), (9.25, 2), (86.11, 3), (84.09, 3)]

Now we should sort the data by the column X value:

islands.sort(key = lambda x: x[0])

To produce:

[(9.25, 2), (17.17, 6), (25.17, 2), (39.45, 1), (66.52, 1), (84.09, 3), (86.11, 3), (99.9, 0), (100.0, 3)]

Now that our data is sorted, we can easily produce a dictionary of values defined by the max value of each bin:

bins = [30, 80, 100]
binned_data = {key: [] for key in bins}
for item in islands:
    for bin in bins:
        if item[0] <= bin:
            binned_data[bin].apppend(item[1])
            break

This results in a dictionary like so:

{80: [1, 1], 100: [0, 3, 3, 3], 30: [2, 6, 2]}

From here you can trivially compute the averages with

averages = {bin: sum(binned_data[bin])/float(len(binned_data[bin])) for bin in binned_data}

Putting it all together:

from openpyxl import load_workbook

wb = load_workbook(filename = 'myfile.xlsx')
ws = wb.get_sheet_by_name("sheet_name")

islands = []

for row in range(2,1002):
    cell_name_1="X{}".format(row) #X variable
    cell_name_2="V{}".format(row) #V variable

    x_val =  float(ws[cell_name_1].value)
    v_val =  int(ws[cell_name_2].value)

    islands.append((x_val,v_val))

islands.sort(key = lambda x: x[0])

bins = [30, 80, 100]
binned_data = {key: [] for key in bins}

for item in islands:
    for bin in bins:
        if item[0] <= bin:
            binned_data[bin].apppend(item[1])
            break

averages = {bin: sum(binned_data[bin])/float(len(binned_data[bin])) for bin in binned_data}

Now this is a very exhaustive answer. Thank you very much for your effort, really appreciated. The only inconvenient is when `float(len(binned_data[bin]))=0`. — FaCoffee, Apr 05 '16 at 17:07
Sure, I didn't give you a perfect piece of code, but you should be able to make your own error handling for that case — wnnmaw, Apr 05 '16 at 17:15
This is a dictionary comprehension (similar to a [list comprehension](https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions)), and yes, it is simply a shorthand. If you are not familiar with these you really should read the linked doc page or consider going through a tutorial as John Y suggested — wnnmaw, Apr 05 '16 at 17:28

Python: how to split one variable according to bins from another variable?

1 Answers1