2

Sorry if the title is confusing I am new to pandas and tried to be as concise as possible. Basically I have a dataframe I'm reading in and for each of the attributes in the data frame I need to quantize them to nearest 2 value by rounding. My approach is to turn them into bins with the values (-1.01, 1.00], (1.00, 3.00] ... and then from the bins I can just find out how many at each to know what the quantized data is. I can see the values using value_counts() but I want to be able to do something with the bins similar to df['Some_Attribute'].loc[df['Some_Attribute'] < 20] but if I replace 'Some_Attribute' with the bin name it will error

I've tried using value_counts() and then turning it into a list and just do it manually but while I can get a list of the values it's not sorted and I'm not sure how I'd know which value in the array corresponds with which range. I've also tried messing around and googling with .loc[] to see if maybe I got the syntax wrong but I haven't been able to figure that out

Edit: To provide better context

Sample_Input:
Age
1.9
2.0
2.4
5.9
6.0
6.4

df = pd.read_csv("Sample_Input.csv",names=attributes, header=0)
df['Age_Bins'] = pd.cut(df['Age'], two_bins)
df['Age_Bins'].loc[df['Age_Bins'] < 8.0]
df['Age_Bins'].loc[df['Age_Bins'] < 6.0]

If I run this I will get the error

TypeError: Invalid comparison between dtype=category and int

The output I would like to happen is for the last two lines in order to output 6 and then 3. If I tried this with a dataframe that wasn't cut it would work so I'm assuming it's trying to compare with the actual ranges instead of the amount of values at each range. Ideally I would like to find a way to get it to work with .loc[] but if that's not possible how do I get it into an array sorted by their ranges?

dmb1o3
  • 21
  • 2
  • might be easier for people to comprehend if you provide a sample input along with expected output as well as the code you've tried so far and any error tracebacks – bn_ln Oct 27 '22 at 00:14

1 Answers1

0

I might be wrong but I think you're looking for cumulative sum at each bin:

import pandas as pd
# sample you provided
df = pd.DataFrame({'age': [1.9, 2.0, 2.4, 5.9, 6.0, 6.4]})
# some bins to show how it works
pd.cut(df['age'], bins=[0, 2, 4, 6, 8], right=False).value_counts(sort=False).cumsum()

Output:

[0, 2)    1
[2, 4)    3
[4, 6)    4
[6, 8)    6
Name: age, dtype: int64

To cut into bins and then see how many each has:

df['age_bins'] = pd.cut(df['age'], bins=[0, 2, 4, 6, 8], right=False)
df.groupby('age_bins').agg('count')

Output:

         age
age_bins    
[0, 2)   1
[2, 4)   2
[4, 6)   1
[6, 8)   2

Again, .cumsum() is applicable here:

df.groupby('age_bins').agg('count').cumsum()

Output:

         age
age_bins    
[0, 2)   1
[2, 4)   3
[4, 6)   4
[6, 8)   6
Nikita Shabankin
  • 609
  • 8
  • 17