1

Assuming I have two rows where for most of the columns the values are same, but not for all. I would like to group these two rows into one where ever the values are same and if the values are different then create an extra column and assign the column name as 'column1'

Step 1: Here assuming I have columns which has same value in both the rows 'a','b','c' and columns which has different values are 'd','e','f' so I am grouping using 'a','b','c' and then unstacking 'd','e','f'

Step 2: Then I am dropping the levels then renaming it to 'a','b','c','d','d1','e','e1','f','f1'

But in my actual case I have 500+ columns and million rows, I dont know how to expand this to 500+ columns where I have constrains like 1) I dont know which all columns will have same values 2) And which all columns will have different values that needs to be converted into new column after grouping with the columns that has same value

df.groupby(['a','b','c']) ['d','e','f'].apply(lambda x:pd.DataFrame(x.values)).unstack().reset_index()

df.columns = df.columns.droplevel()
df.columns = ['a','b','c','d','d1','e','e1','f','f1']

enter image description here

To be more clear, the below code creates the sample dataframe & expected output

df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019',
                                                                                                                   '01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'],
              'Product': ['a','a','b','c','d','d', 'e','e']})



expected_output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA']
                                  , 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]})
Bala
  • 111
  • 8
  • If you don't know which columns are supposed to be grouped, it is impossible to do groupby and turn the rest columns to rows. – Andy L. Aug 20 '19 at 02:35
  • Alright, thanks for the reply Andy. The column 'a' is more like a customer ID. For that customer ID there are multiple transactions and each transaction will have list of multiple columns like sex,date,price,product,etc so basically when the customer buys multiple products each product each product will create a new row, I need to group it using 'Customer ID', but since I have multiple dates of transaction for the same 'Customer ID' its a problem, not all customers have multiple dated transaction. – Bala Aug 20 '19 at 02:40
  • I would like to group all the cols that has same value for that particular 'Customer ID' and then create new columns for the cols with different values example: Product,Product1,etc – Bala Aug 20 '19 at 02:44
  • could you provide a sample data and desired output for what you explained in the comment? – Andy L. Aug 20 '19 at 04:24
  • df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019', '01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'], 'Product': ['a','a','b','c','d','d', 'e','e']}) – Bala Aug 20 '19 at 05:29
  • expected output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA'] , 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]}) – Bala Aug 20 '19 at 05:31
  • Base on your desired output. I came up with a solution. Please check my answer – Andy L. Aug 20 '19 at 07:33

2 Answers2

1

you may do following to get expected_output from df

s = df.groupby('Cust_id').cumcount().astype(str).replace('0', '')
df1 = df.pivot_table(index=['Cust_id', 'gender'], columns=s, values=['Date', 'Product'], aggfunc='first')
df1.columns = df1.columns.map(''.join)

Out[57]:
                      Date       Date1 Product Product1
Cust_id gender
100     M       01/01/2019  02/01/2019       a        a
101     F       01/01/2019  01/01/2019       b        c
102     M       03/01/2019         NaN       d      NaN
103     F       04/01/2019         NaN       d      NaN
104     F       03/01/2019  03/01/2019       e        e

Next, replace columns having duplicated values with NA

df_expected = df1.where(df1.ne(df1.shift(axis=1)), 'NA').reset_index()

Out[72]:
   Cust_id gender        Date       Date1 Product Product1
0      100      M  01/01/2019  02/01/2019       a       NA
1      101      F  01/01/2019          NA       b        c
2      102      M  03/01/2019          NA       d       NA
3      103      F  04/01/2019          NA       d       NA
4      104      F  03/01/2019          NA       e       NA
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • It is really helpful that you provide the sample and `expected_output`. Therefore, I also upvoted your question. Happy coding. :) – Andy L. Aug 20 '19 at 07:43
0

You can try this code - it could be a little cleaner but I think it does the job


df = pd.DataFrame({'a':[100, 100], 'b':['tue', 'tue'], 'c':['yes', 'yes'], 
                   'd':['ok', 'not ok'], 'e':['ok', 'maybe'], 'f':[55, 66]})

df_transformed = pd.DataFrame()

for column in df.columns:
    col_vals = df.groupby(column)['b'].count().index.values
    for ix, col_val in enumerate(col_vals):
        temp_df = pd.DataFrame({column + str(ix) : [col_val]})
        df_transformed = pd.concat([df_transformed, temp_df], axis = 1)

Output for df_transformed

v_a
  • 98
  • 1
  • 6
  • Note that I have taken the count for column 'b' but actually it could be any arbitrary column in the dataframe – v_a Aug 20 '19 at 03:27