3

I'm relatively new to Pandas and after searching here for solutions I have not been able to find anything that translates well to my problem.

I have a DataFrame like this (df_transactions):

 col_type      col_ticker      col_transacted_value
 'buy'         'TSLA'          '100'                           
 'buy'         'TSLA'          '100'                           
 'buy'         'TSLA'          '100'                           
 'sell'        'TSLA'          '300'                              
 ...           ...             ...

And I would like to create two new columns, with the cumulative previous costs and cumulative costs.

To get the previous cost I have used:

df_transactions[col_prev_cost] = df_transactions.groupby(col_ticker)[col_transacted_value].shift().fillna(0)

but then realized this only finds the previous occurrence of that ticker and populate a new column with it´s transacted_costs.

My idea is too group by ticker and depending on the value of col_type populate a new column 'col_prev_costs', and with that column create the new column 'cml_cost'. Or if you see a simpler way - do share!

I want it to operate like this:


 col_type      col_ticker      col_transacted_value        New_col_prev_cost       New_col_cml_cost
 'buy'         'TSLA'          '100'                       '0'                     '100'
 'buy'         'TSLA'          '100'                       '100'                   '200'
 'buy'         'TSLA'          '100'                       '200'                   '300'
 'sell'        'TSLA'          '300'                       '300'                   '0'     
 ...           ...             ...                         ...                      ...

I'm open to all suggestions!

Thanks in advance! / Jakob


UPDATE: My first question was somewhat flawed so I will try to explain it again.

I have now discovered that I missed the correct handling of transactions with sell type, as this is dependent on even previous holdings of the ticker.

Let´s say I have the given dataframe:

import pandas as pd
import numpy as np

# Create the dataframe:
data = {'date'              : ['2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03', '2022-01-03', '2022-01-03', '2022-01-03', '2022-01-03', '2022-01-04'],
        'ticker'            : ['TSLA','TSLA', 'TSLA', 'TSLA', 'TSLA', 'TSLA', 'TSLA', 'AAPL', 'AAPL'],
        'type'              : ['buy', 'buy', 'split', 'sell', 'buy', 'sell', 'buy', 'buy', 'buy'],
        'units'             : [2, 2, 3, 2, 2, 2, 2, 2.5, 4],
        'price'             : [50.5, 50.5, 0, 52, 35, 52, 35, 50.5, 50.5],
        'fees'              : [4, 4, 0, 2.5, 3, 2.5, 3, 1.5, 2],
}

df = pd.DataFrame(data)
df
Out[2]: 
         date ticker   type  units  price  fees
0  2022-01-01   TSLA    buy    2.0   50.5   4.0
1  2022-01-02   TSLA    buy    2.0   50.5   4.0
2  2022-01-02   TSLA  split    3.0    0.0   0.0
3  2022-01-03   TSLA   sell    2.0   52.0   2.5
4  2022-01-03   TSLA    buy    2.0   35.0   3.0
5  2022-01-03   TSLA   sell    2.0   52.0   2.5
6  2022-01-03   TSLA    buy    2.0   35.0   3.0
7  2022-01-03   AAPL    buy    2.5   50.5   1.5
8  2022-01-04   AAPL    buy    4.0   50.5   2.0

Whereas I calculate the following columns:

# Create the column transaction value

def calc_transaction_value(type_transaction, price_unit, transacted_units, fees):
    """
    Calculate the transaction value from three columns in a dataframe depending on the value of type.
    """
    if type_transaction == 'buy':
        return price_unit * transacted_units + fees # i.e. the transacted cost
    if type_transaction == 'sell':
        return price_unit * transacted_units - fees # i.e. the gross income of capital
    else:
        return np.nan # If other return filler return NA (i.e. not available)

df['transacted_value'] = df.apply(lambda x: calc_transaction_value(x['type'], x['price'], x['units'], x['fees']), axis=1).fillna(0)

df
Out[3]: 
         date ticker   type  units  price  fees  transacted_value
0  2022-01-01   TSLA    buy    2.0   50.5   4.0            105.00
1  2022-01-02   TSLA    buy    2.0   50.5   4.0            105.00
2  2022-01-02   TSLA  split    3.0    0.0   0.0              0.00
3  2022-01-03   TSLA   sell    2.0   52.0   2.5            101.50
4  2022-01-03   TSLA    buy    2.0   35.0   3.0             73.00
5  2022-01-03   TSLA   sell    2.0   52.0   2.5            101.50
6  2022-01-03   TSLA    buy    2.0   35.0   3.0             73.00
7  2022-01-03   AAPL    buy    2.5   50.5   1.5            127.75
8  2022-01-04   AAPL    buy    4.0   50.5   2.0            204.00

# create the flow of units, depends on transaction type (buy or sell)
df["flow_units"] = df.apply(lambda x: -x["units"] if x["type"] == "sell" else x["units"], axis=1)

# Create the cml_units and prev_units column
df = df.groupby("ticker").apply(lambda grp: grp.assign(cml_units=grp["flow_units"].cumsum().abs(),
                                                           prev_units=grp["flow_units"].shift(1).cumsum().abs().fillna(0)))
df
Out[4]: 
         date ticker   type  units  price  fees  transacted_value  flow_units  \
0  2022-01-01   TSLA    buy    2.0   50.5   4.0            105.00         2.0   
1  2022-01-02   TSLA    buy    2.0   50.5   4.0            105.00         2.0   
2  2022-01-02   TSLA  split    3.0    0.0   0.0              0.00         3.0   
3  2022-01-03   TSLA   sell    2.0   52.0   2.5            101.50        -2.0   
4  2022-01-03   TSLA    buy    2.0   35.0   3.0             73.00         2.0   
5  2022-01-03   TSLA   sell    2.0   52.0   2.5            101.50        -2.0   
6  2022-01-03   TSLA    buy    2.0   35.0   3.0             73.00         2.0   
7  2022-01-03   AAPL    buy    2.5   50.5   1.5            127.75         2.5   
8  2022-01-04   AAPL    buy    4.0   50.5   2.0            204.00         4.0   

   cml_units  prev_units  
0        2.0         0.0  
1        4.0         2.0  
2        7.0         4.0  
3        5.0         7.0  
4        7.0         5.0  
5        5.0         7.0  
6        7.0         5.0  
7        2.5         0.0  
8        6.5         2.5  

Where this is from the help I got from the previous comment @SRawson.

But here I would like to create the columns prev_costs, cml_costs and cost_transaction. The logic is as follows:

I want to for each row, get the ticker from column "ticker", and:

  • if transaction type value is equal to "buy" in column "type" then compute the operation "prev_costs + transacted_value" and populate in column "cml_costs".
    • Where "prev_costs" for that row is "cml_costs" but shifted one row up (if grouped by ticker).
  • if transaction type value is equal to "sell" then compute the operation "prev_costs - cost_transaction" and populate in "cml_costs".
    • Where "prev_costs" for that row is "cml_costs" but shifted one row up (if grouped by ticker).
    • And where "cost_transaction" is the result of operation (for that row): (units/cml_units) * prev_costs
  • else return "prev_costs" and populate in "cml_costs".

I expect the dataframe:

data_2 = {'prev_costs'           : [0, 105, 210, 210, 150, 223, 159.29, 0, 127.75],
          'cml_costs'            : [105, 210, 210, 150, 223, 159.29, 232.29, 127.75, 331.75],
          'cost_transaction'     : [0, 0, 0, 60, 0, 63.71, 0, 0, 0],
}

df2 = pd.DataFrame(data_2)

df_expected = pd.concat([df, df2], axis=1, join='inner')

df_expected
Out[5]: 
         date ticker   type  units  price  fees  transacted_value  flow_units  \
0  2022-01-01   TSLA    buy    2.0   50.5   4.0            105.00         2.0   
1  2022-01-02   TSLA    buy    2.0   50.5   4.0            105.00         2.0   
2  2022-01-02   TSLA  split    3.0    0.0   0.0              0.00         3.0   
3  2022-01-03   TSLA   sell    2.0   52.0   2.5            101.50        -2.0   
4  2022-01-03   TSLA    buy    2.0   35.0   3.0             73.00         2.0   
5  2022-01-03   TSLA   sell    2.0   52.0   2.5            101.50        -2.0   
6  2022-01-03   TSLA    buy    2.0   35.0   3.0             73.00         2.0   
7  2022-01-03   AAPL    buy    2.5   50.5   1.5            127.75         2.5   
8  2022-01-04   AAPL    buy    4.0   50.5   2.0            204.00         4.0   

   cml_units  prev_units  prev_costs  cml_costs  cost_transaction  
0        2.0         0.0        0.00     105.00              0.00  
1        4.0         2.0      105.00     210.00              0.00  
2        7.0         4.0      210.00     210.00              0.00  
3        5.0         7.0      210.00     150.00             60.00  
4        7.0         5.0      150.00     223.00              0.00  
5        5.0         7.0      223.00     159.29             63.71  
6        7.0         5.0      159.29     232.29              0.00  
7        2.5         0.0        0.00     127.75              0.00  
8        6.5         2.5      127.75     331.75              0.00  

I was not sure if it is okey to "update" a question in such a major way, but hopes it is. Please comment if this is wrong from my part.

I have made this work in google sheets, but as I said in the first question I'm quite new to pandas and in desperate need for help again.

Thanks in advance (again)!

  • Can you explain the logic for calculating `New_col_prev_cost `? – Shubham Sharma Mar 27 '22 at 15:00
  • Yes absolutely! Calculation of New_col_prev_cost should: for each row, find the same ticker (in the ticker column) and depending on the transaction type in col_type sum or subtract the value in col_transacted_value (i.e we get the cumulative sum of the buy transaction subtracted by the cumulative sum of the sell transaction from the prev entry). Then populate a new column named New_col_prev_cost. – Jakob Rydberg Mar 28 '22 at 09:37
  • There might be another solution by calculating new_col_cml_cost first by, finding the cumsum by ticker of the col_transacted_value by col_type 'buy' and then by sell and calculate the difference and populate the New_col_cml_cost column. Then use that to create the New_col_prev_cost column from col_transacted_value and New_col_cml_cost. Hope I did explain it somewhat okey. – Jakob Rydberg Mar 28 '22 at 09:47
  • 1
    So the 'New_col_prev_cost' should show me the transacted cumulative value of that from previous entry of that ticker :) @ShubhamSharma – Jakob Rydberg Mar 28 '22 at 10:14

1 Answers1

1

Here is one way to do it:

import pandas as pd

df = pd.DataFrame(data={"col_type": ["buy", "buy", "buy", "sell", "buy", "sell", "buy"],
                        "col_ticker": ["TSLA", "TSLA", "TSLA", "TSLA", "AAPL", "AAPL", "AAPL"],
                        "col_transacted_value": ["100", "100", "100", "300", "150", "200", "100"]})

df["col_transacted_value"] = df["col_transacted_value"].astype(int)
df["col_transacted_value_signs"] = df.apply(lambda x: -x["col_transacted_value"] if x["col_type"] == "sell" else x["col_transacted_value"], axis=1)
df = df.groupby("col_ticker").apply(lambda grp: grp.assign(New_col_cml_cost=grp["col_transacted_value_signs"].cumsum(),
                                                           New_col_prev_cost=grp["col_transacted_value_signs"].shift(1).cumsum().fillna(0)))

The explanation:

  • Convert your string values to integers (else floats if you will not always have integers).
  • Create a new column with -ve values for "sell" and +ve for "buy" (you could overwrite the col_transacted_value column if you don't need all items to be +ve.
  • Finally, group by col_ticker and assign the new columns using apply lambda.

I took inspiration for the groupby apply assign idea from this comment from krassowski.

Output (TSLA is like your example):

#Out: 
#             col_type col_ticker  ...  New_col_cml_cost  New_col_prev_cost
#col_ticker                        ...                                     
#AAPL       4      buy       AAPL  ...               150                0.0
#           5     sell       AAPL  ...               -50              150.0
#           6      buy       AAPL  ...                50              -50.0
#TSLA       0      buy       TSLA  ...               100                0.0
#           1      buy       TSLA  ...               200              100.0
#           2      buy       TSLA  ...               300              200.0
#           3     sell       TSLA  ...                 0              300.0
#
#[7 rows x 6 columns]
Rawson
  • 2,637
  • 1
  • 5
  • 14
  • Thanks for the comprehensive answer and solution @SRawson. I really only have one thought and that is if it is common practice to create "temporary columns", i.e. columns I don´t have a direct use of (e.g. the column for "col_transacted_value_signs") and is it problematic to do this and thus should it be avoided? Thanks again! – Jakob Rydberg Mar 31 '22 at 18:33
  • 1
    I'm glad it works for you. That's a very interesting question, one I haven't ever thought of or seen a discussion on before. I don't think it is too common to create a single column that is not used much, I tend to see new dataframes created more frequently, and already-existing columns changed. I only think it would be a problem if you were returning the dataframe to be viewed by a user, and even then it would be easy to exclude the column. However, I don't think I am the most conventional user of Python, so don't take my word as final! – Rawson Apr 01 '22 at 19:46
  • 1
    Was there a problem with the code that caused you to "unaccept" this solution? – Rawson Apr 06 '22 at 20:21
  • Your answer was excellent, nothing wrong at all. When I updated the question (which I'm a little unsure if it's allowed) I was unsure if I should keep the label as accepted due to it is linked to it. I'm terribly sorry and was not meant as an insult, as I was unsure what the custom was if the question is updated. For some reason I tought it was meant for labelling a question as "complete", but of course I change back and thank you for the clarification of the label. :) – Jakob Rydberg Apr 07 '22 at 21:39
  • No problem at all. I just wanted to make sure something hadn't suddenly gone wrong! – Rawson Apr 08 '22 at 05:35