0

0

I am working with a data set from SQL currently -

import pandas as pd
df = spark.sql("select * from donor_counts_2015")
df_info = df.toPandas()
print(df_info)

The output looks like this (I can't include the actual output for privacy reasons): enter image description here

As you can see, it's a data set that has the name of a fund and then the number of people who have donated to that fund. What I am trying to do now is calculate what percent of funds have only 1 donation, what percent have 2, 34, etc. I am wondering if there is an easy way to do this with pandas? I also would appreciate if you were able to see the percentage of a range of funds too, like what percentage of funds have between 50-100 donations, 500-1000, etc. Thanks!

2 Answers2

0

You can make a histogram of the donations to visualize the distribution. np.histogram might help. Or you can also sort the data and count manually.

etudiant
  • 111
  • 4
0

For the first task, to get the percentage the column 'number_of_donations', you can do:

df['number_of_donations'].value_counts(normalize=True) * 100

For the second task, you need to create a new column with categories, and then make the same:

# Create a Serie with categories
New_Serie = pd.cut(df.number_of_donations,bins=[0,100,200,500,99999999],labels = ['Few','Medium','Many','Too Many'])
# Change the name of the Column
New_Serie.name = Category
# Concat df and New_Serie
df = pd.concat([df, New_Serie], axis=1)
# Get the percentage of the Categories
df['Category'].value_counts(normalize=True) * 100
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459