2

I have input data as given below

Date        Investment Type                                    Medium
1/1/2000    Mutual Fund, Stocks, Fixed Deposit, Real Estate    Own, Online,Through Agent
1/2/2000    Mutual Fund, Stocks, Real Estate                   Own
1/3/2000    Fixed Deposit                                      Online
1/3/2000    Mutual Fund, Fixed Deposit, Real Estate            Through Agent
1/2/2000    Stocks                                             Own, Online,                            Through Agent

The input to my function is Medium. It could be a single value of a list. I want to search the data based on Medium input and then aggregate the data as given below. For the values in Medium, checkout what the Investment types and then aggregate data for each Investment type

Medium                                Investment Type           Date
Own,Online                            Mutual Fund               1/1/2000,1/2/2000 
Own,Online                            Stocks                    1/1/2000,1/2/2000
Own,Online                            Fixed Deposit             1/1/2000,1/3/2000
Own,Online                            Real Estate               1/1/2000
Sid
  • 552
  • 6
  • 21

1 Answers1

2

You can use:

L = ['Online','Own']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['New_Medium'] = df.pop('Medium').str.findall('('+ pat + ')').str.join(', ')
#remove rows with empty values
df = df[df['New_Medium'].astype(bool)]

from  itertools import product
df1 = pd.DataFrame([j for i in df.apply(lambda x: x.str.split(',\s*')).values 
                      for j in product(*i)], columns=df.columns)
print (df1)
        Date Investment Type New_Medium
0   1/1/2000     Mutual Fund        Own
1   1/1/2000     Mutual Fund     Online
2   1/1/2000          Stocks        Own
3   1/1/2000          Stocks     Online
4   1/1/2000   Fixed Deposit        Own
5   1/1/2000   Fixed Deposit     Online
6   1/1/2000     Real Estate        Own
7   1/1/2000     Real Estate     Online
8   1/2/2000     Mutual Fund        Own
9   1/2/2000          Stocks        Own
10  1/2/2000     Real Estate        Own
11  1/3/2000   Fixed Deposit     Online
12  1/2/2000          Stocks        Own
13  1/2/2000          Stocks     Online

#get all combinations and aggregate join by unique values
df = df1.groupby('Investment Type').agg(lambda x: ', '.join(x.unique())).reset_index()
print (df)
  Investment Type                Date   New_Medium
0   Fixed Deposit  1/1/2000, 1/3/2000  Own, Online
1     Mutual Fund  1/1/2000, 1/2/2000  Own, Online
2     Real Estate  1/1/2000, 1/2/2000  Own, Online
3          Stocks  1/1/2000, 1/2/2000  Own, Online
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This solution worked well...I am new to Pandas and Python, so thank you for your patience and guidance. – Sid Sep 07 '18 at 05:36
  • By using pop we are taking out the original medium column, is there a way I can retain it? I want to have the original medium column and new medium column shown in the aggregate result – Sid Sep 07 '18 at 05:50
  • you can use `df.pop('Medium')` to `df['Medium']` – Pyd Sep 07 '18 at 07:19
  • Hi @Jezrael, I have run into space issue using your recommendation of using Cartesian product. The number of columns have been added to the input and so the output of Cartesian product is overwhelming. Is there any other solution you recommend? Thanks – Sid Jan 16 '19 at 11:55
  • @Sid - Yes, it create many values if larger data, unfortunately no another idea :( – jezrael Jan 16 '19 at 12:00
  • 1
    ok, I am working on using loops. If it works, I will post the solution here. Thanks for your help. – Sid Jan 16 '19 at 12:06