-1

I am a bit new to Python and I have the following requirement.

Value         Date  Ticket  Source  Result
0.939531371 3/1/2018    T1  Source1  1
0.951619897 3/1/2018    T2  Source1  1
0.952681914 3/1/2018    T3  Source1  0
0.957009407 3/1/2018    T4  Source2  1
0.962669466 3/1/2018    T5  Source2  0 
0.963068552 3/1/2018    T6  Source3  1
0.963480195 3/1/2018    T7  Source4  1
0.951296258 3/2/2018    T11 Source5  1
0.962434762 3/2/2018    T12 Source5  1
0.950224149 3/2/2018    T13 Source5  1
0.961191873 3/2/2018    T14 Source5  1
0.952584896 3/2/2018    T15 Source5  0
0.962093595 3/2/2018    T16 Source5  1
0.975999737 3/2/2018    T17 Source5  1

From this data frame I wanted to generate the below dataframe-

Date      Source1 Source2 Source3 Source4 Source5 Overall_Result
3/1/2018   0.93     0.95   0.96     0.96      0        0.7
3/2/2018     0        0      0        0      0.95      0.85

The calculation involved here is of weighted average-

Source1 - For a given Date and for a given Source, we need to obtain the total number of tickets. For Source1, there are 3 records (tickets) on a given date(3/1/2018). The "Value" column for these tickets have to be sorted in ascending order. Then based on the count of tickets, the highest weightage has to be given to the least "Value"

0.945=(0.939531370722655*3) +(0.951619897215127*2)+(1*0.952681914218488)/3+2+1

Overall Result column is calculated as for a given date,

how many were 1s divided by overall count of tickets for that date Date - 3/1/2018 => 1+1+0+1+0+1+1+0+1+1/10=0.66

I have huge amount of data for which these calculations has to be done. The number of Source column values can also be immense. In modified dataframe, I want it as a column. One way to do is to write the logic in a function and call upon every record. Any suggestions or help are welcome. Thanks in advance.

2 Answers2

1

You can use pivot_table with a custom aggregation function to get the first columns. And groupby to add the 'Result' column.

import numpy as np
import pandas as pd

df2 = df.sort_values('Value').pivot_table(
        index='Date', 
        columns='Source',
        values='Value',
        aggfunc = lambda x: (x*np.arange(len(x), 0, -1)).sum()/np.arange(len(x), 0, -1).cumsum()[-1]).fillna(0)

df2['Result'] = df.groupby('Date').Result.apply(lambda x: x.sum()/np.size(x))

Output:

Source     Source1   Source2   Source3  Source4   Source5    Result
Date                                                               
3/1/2018  0.945753  0.958896  0.963069  0.96348  0.000000  0.714286
3/2/2018  0.000000  0.000000  0.000000  0.00000  0.955507  0.857143
ALollz
  • 57,915
  • 7
  • 66
  • 89
0

You may have forgotten to use brackets in your weighted average calculation :

> 0.93=[(0.939531370722655*3) +(0.951619897215127*2)+(1*0.952681914218488)]/(3+2+1)

Also, try using a pivot_table on the "Value" with a custom aggregation function :

def func(series) : 
  s = series.sort_values().reset_index(drop=True).reset_index()
  return s.apply(lambda x : (len(s) - x["index"]) * x["Value"] /sum(np.arange(1, len(s) + 1)), axis=1).sum()

The function above computes the weighted average on a pandas series:

  • First you sort the values and reset the index with the sorted order.
  • Then you compute the weighted average.

This aggregation function is then called as follows:

df1 = df.pivot_table(index="Date", columns="Source", aggfunc={"values" : func})

Which returns :

+----------+----------+----------+----------+---------+----------+
|          |  Value   |          |          |         |          |
+----------+----------+----------+----------+---------+----------+
| Source   | Source1  | Source2  | Source3  | Source4 | Source5  |
| Date     |          |          |          |         |          |
+----------+----------+----------+----------+---------+----------+
| 3/1/2018 | 0.945753 | 0.958896 | 0.963069 | 0.96348 | NaN      |
| 3/2/2018 | NaN      | NaN      | NaN      | NaN     | 0.955507 |
+----------+----------+----------+----------+---------+----------+

Then for the Overall_Result :

df2 = df.pivot_table(index="Date",  values="Result", aggfunc="mean")

Returns

+----------+----------+
|          |  Result  |
+----------+----------+
| Date     |          |
| 3/1/2018 | 0.714286 |
| 3/2/2018 | 0.857143 |
+----------+----------+

Finally you can concatenate your two dataFrame to get the desired dataframe :

df1.columns = df1.columns.droplevel()
df2.columns = ["Overall_Result" ]
dfResult = pd.concat([df1, df2], axis=1)

i.e.

+----------+----------------+----------+----------+----------+---------+----------+
|          | Overall_Result | Source1  | Source2  | Source3  | Source4 | Source5  |
+----------+----------------+----------+----------+----------+---------+----------+
| Date     |                |          |          |          |         |          |
| 3/1/2018 |       0.714286 | 0.945753 | 0.958896 | 0.963069 | 0.96348 | NaN      |
| 3/2/2018 |       0.857143 | NaN      | NaN      | NaN      | NaN     | 0.955507 |
+----------+----------------+----------+----------+----------+---------+----------+
Grant Shannon
  • 4,709
  • 1
  • 46
  • 36
Carole Mai
  • 46
  • 4