Python datatable/pandas reshaping problem

Question

I need to reshape my df.

This is my input df:

import pandas as pd
import datatable as dt

DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
             date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
             type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
             value=[1, 2, 3, 4, 5, 6, 7, 8])

   | name   date        type  value
-- + -----  ----------  ----  -----
 0 | name1  2021-01-01  a         1
 1 | name1  2021-01-02  b         2
 2 | name1  2021-01-03  a         3
 3 | name1  2021-01-04  b         4
 4 | name2  2021-01-05  b         5
 5 | name2  2021-01-06  a         6
 6 | name2  2021-01-07  b         7
 7 | name2  2021-01-08  a         8

This is the desired output df:

DF_out = dt.Frame(name=['name1', 'name1', 'name2', 'name2'],
              date_a=['2021-01-01', '2021-01-03', '2021-01-06', '2021-01-08'],
              date_b=['2021-01-02', '2021-01-04', '2021-01-07', None],
              value_a=[1, 3, 6, 8],
              value_b=[2, 4, 7, None])

   | name   date_a      date_b      value_a  value_b
-- + -----  ----------  ----------  -------  -------
 0 | name1  2021-01-01  2021-01-02        1        2
 1 | name1  2021-01-03  2021-01-04        3        4
 2 | name2  2021-01-06  2021-01-07        6        7
 3 | name2  2021-01-08  NA                8       NA

If necessary the datatable Frames can be converted into a pandas DataFrame:

DF_in = DF_in.to_pandas()

Transformation:

This is a grouped transformation. The grouping column is 'name'.
The df is already sorted
The number of rows in each group is different and can be even or uneven
If the first row in a group has a 'b' in the column 'type' it has to be removed (example: row 4 in DF_in)
It is also possible that the last row in a group has an 'a' in the column 'type', this row should not get lost (example: row 7 in DF_in)

I hope this explanation is understandable.

Thank you in advance

Given that `name1` is the value for `name` in both of the first two rows, why is `2021-01-04` matched to `2021-01-03` instead of to `2021-01-01` for date and `4` to `3` instead of to `1` for value? Is this just by proximity? — semblable, Apr 03 '21 at 16:16
Exactly it is by proximity. The df is sorted and a row that contains the value 'a' in the column 'type' should be matched to the row below it if it contains the value 'b'. This has to happen groupwise. It gets a bit more difficult since the number of rows per group is not always even and they don't always start with the value 'a' and end with the value 'b' in column 'type'. — peter, Apr 03 '21 at 16:26

peter · Accepted Answer · 2021-04-09T15:45:07.123

Thank you all very much for your answers. In the meantime I developed a solution that uses only datatable package a uses some workarounds for the current limitations:

define a function to create id for adjacent rows: 1,1,2,2,...
create column id that contains row index
get id of rows to be deleted as list
subtract row id's to be deleted from all row id's
subset the Frame based on the remaining row id's
get number of rows per group
use the function for each group and use the number of rows as input, create a list with all results (same length as Frame after subset). Bind this to the Frame
create two subset Frames based on column type ('a' or 'b')
join df2 on df1

code:

import math
import datatable as dt
from datatable import dt, f, by, update, join

DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
                 date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
                 type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
                 value=[1, 2, 3, 4, 5, 6, 7, 8])



def group_id(n):
    l = [x for x in range(0, math.floor(n / 2))]
    l = sorted(l * 2)
    if n % 2 != 0:
        try:
            l.append(l[-1] + 1)
        except IndexError:
            l.append(0)
    return l


DF_in['id'] = range(DF_in.nrows)
first_row = f.id==dt.min(f.id)
row_eq_b = dt.first(f.type)=="b"
remove_rows = first_row & row_eq_b
DF_in[:, update(remove_rows = ~remove_rows), 'name']
DF_in = DF_in[f[-1]==1, :-1]
group_count = DF_in[:, {"Count": dt.count()}, by('name')][:, 'Count'].to_list()[0]
group_id_column = []

for x in group_count:
    group_id_column = group_id_column + group_id(x)

DF_in['group_id'] = dt.Frame(group_id_column)
df1 = DF_in[f.type == 'a', ['name', 'date', 'value', 'group_id']]
df2 = DF_in[f.type == 'b', ['name', 'date', 'value', 'group_id']]

df2.key = ['name', 'group_id']
DF_out = df1[:, :, join(df2)]
DF_out.names = {'date': 'date_a', 'value': 'value_a', 'date.0': 'date_b', 'value.0': 'value_b'}

DF_out[:, ['name', 'date_a', 'date_b', 'value_a', 'value_b']]

   | name   date_a      date_b      value_a  value_b
-- + -----  ----------  ----------  -------  -------
 0 | name1  2021-01-01  2021-01-02        1        2
 1 | name1  2021-01-03  2021-01-04        3        4
 2 | name2  2021-01-06  2021-01-07        6        7
 3 | name2  2021-01-08  NA                8       NA

@sammywemmy, thank you very much for cleaning up the solution - it needs only 30% of the time to run compared to the original code. — peter, Apr 05 '21 at 12:20
You are welcome @peter. Hopefully more functions are added to datatable to remove these limitations — sammywemmy, Apr 05 '21 at 20:38

score 1 · Answer 2 · answered Apr 03 '21 at 17:41

Let us work with dataframes, so load the data first

df = pd.DataFrame(dict(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
             date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
             type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
             value=[1, 2, 3, 4, 5, 6, 7, 8]))

Then in the below we do the following steps

get rid of second bs
assign the group number in column 'g'
pivot the table via set_index + unstack
rename the columns to the desired format
drop unneeded columns

df1 = df[~((df['type'] == 'b') & (df['type'].shift() == 'b'))].copy()
df1['g'] = np.arange(len(df1))//2
df2 = df1.set_index(['g','type']).unstack(level=1)
df2.columns = ['_'.join(tup).rstrip('_') for tup in df2.columns.values]
df2.drop(columns = 'name_b').rename(columns = {'name_a':'name'})

output

    name    date_a      date_b      value_a value_b
g                   
0   name1   2021-01-01  2021-01-02  1.0     2.0
1   name1   2021-01-03  2021-01-04  3.0     4.0
2   name2   2021-01-06  2021-01-07  6.0     7.0
3   name2   2021-01-08  NaN         8.0     NaN

Thank you very much for your answer. The first step of your solution is not robust for my data however. The last row of a group might contain an 'a' and the first row of the next group might contain a 'b'. Getting rid of secondary 'b' would fail in this case. — peter, Apr 03 '21 at 18:30
Also, the second line of your code should be performed per group since the number of rows in a group can be uneven. — peter, Apr 03 '21 at 18:39

sammywemmy · Answer 3 · 2021-04-03T23:51:57.730

datatable does not have reshaping functions that allow flipping between vertical and horizontal positions; as such, pandas is your best bet.

Below is my attempt at your challenge:

    from datatable import dt
    import pandas as pd

    df = DF_in.to_pandas()

    (df
     .assign(temp = df.index, # needed for ranking
             b_first = lambda df: df.groupby('name')['type'].transform('first'))
     .assign(temp = lambda df: df.groupby('name')['temp'].rank())
      # get rid of rows in groups where b is first
     .query('~(temp==1 and b_first=="b")')
      # needed to get unique values in index when pivoting
     .assign(temp = lambda df: df.groupby(['name','type']).cumcount())
     .pivot(['name','temp'], ['type'], ['date','value'])
     .pipe(lambda df: df.set_axis(df.columns.to_flat_index(), axis='columns')
     .rename(columns = lambda df: "_".join(df)))
     .droplevel('temp')
     .reset_index()
      )

    name      date_a      date_b value_a value_b
0  name1  2021-01-01  2021-01-02       1       2
1  name1  2021-01-03  2021-01-04       3       4
2  name2  2021-01-06  2021-01-07       6       7
3  name2  2021-01-08         NaN       8     NaN

Summary:

Filter out the rows where 'b' is the first entry in the group
to avoid error due to duplicate indices when pivoting(reindexing), create a temporary cumcount column

the rest relies on pivot and some name editing (set_axis and rename functions). You can abstract a bit further with the pivot_wider function from pyjanitor:

 # pip install pyjanitor
 import janitor

 (df
 .assign(temp = df.index, 
         b_first = lambda df: df.groupby('name')['type'].transform('first'))
 .assign(temp = lambda df: df.groupby('name')['temp'].rank())
 .query('~(temp==1 and b_first=="b")')
 .assign(temp = lambda df: df.groupby(['name','type']).cumcount())
 .pivot_wider(index=['name', 'temp'], 
              names_from=['type'], 
              values_from=['date','value'],   
              names_sep="_",
              names_from_position='last')
 .drop(columns='temp')
  )

Hi @sammywemmy, thanks for your answer. I will check it out. Also I came up with a solution without converting to pandas. I am curious what you think. — peter, Apr 04 '21 at 16:05

Python datatable/pandas reshaping problem

3 Answers3

Linked