1

I have a DataFrame

         date     price num_floors    house
1  2023-01-01  94.30076          3        A
2  2023-01-01  95.58771          2        B
3  2023-01-02 102.78559          1        C
4  2023-01-03  93.29053          3        D

and I want to change it, so that each column contains the prices and num_floors for all houses for a given date. For one column, the first two rows of a column refer to the first house, the next two to the second house. The remaining entries without data are filled with the missing value NaN, like this:

  2023-01-01    2023-01-02  2023-01-03
1   94.30076     102.78559    93.29053
2          3             1           3         
3   95.58771            NA          NA
4          2            NA          NA

I succeed using R:

df_trans <- df %>%
    pivot_longer(-date) %>%
    mutate(index=row_number(), .by = date) %>%
    pivot_wider(id_cols = index, names_from = date, values_from = value) %>%
    select(-index)

but when I try with python, it does not work as I want:

df_trans = df.melt(id_vars='date')
df_trans['n'] = df_trans.groupby('date').cumcount() + 1
df_trans = df_trans.pivot(index='n', columns='date', values='value')

jonas Chen
  • 11
  • 2
  • 2
    Aside: your desired result suggests an unclear data model or report. You lose all meaning of data with no index identifier. Plus, there are four houses in original data not two as mentioned in text. With so many whys to ask, this feels like an [XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Parfait May 28 '23 at 22:18

3 Answers3

1

Try:

df = df.drop(columns='house')
df['tmp'] = df.groupby('date').cumcount()
df = df.set_index(['date', 'tmp']).stack().unstack('date').reset_index(drop=True)
df.columns.name = None

print(df)

Prints:

   2023-01-01  2023-01-02  2023-01-03
0    94.30076   102.78559    93.29053
1     3.00000     1.00000     3.00000
2    95.58771         NaN         NaN
3     2.00000         NaN         NaN
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0
pd.concat([j.set_index("date")[["price", "num_floors"]].T \
    for i, j in df.groupby(df.groupby("date").cumcount())])

In sections parts:

  1. Group by date and cumulative count, and group by this
  2. For each group, manipulate to desired output
  3. Concatenate groups
Rawson
  • 2,637
  • 1
  • 5
  • 14
0

Another possible solution:

(pd.DataFrame(np.vstack([np.vstack([[x, y], [x, z]]) for x, y, z in
    zip(df['date'], df['price'], df['num_floors'])]))
 .pivot(columns=0, values=1).rename_axis(None, axis=1)
 .apply(lambda x: x.dropna(ignore_index=True)))

Alternatively,

(df.assign(
    price = [[x,y] for x,y in zip(df['price'], df['num_floors'])])
 .pivot(columns='date', values='price')
 .apply(lambda x: x.explode(ignore_index=True))
 .rename_axis(None, axis=1)
 .apply(lambda x: x.dropna(ignore_index=True)))

Output:

  2023-01-01 2023-01-02 2023-01-03
0   94.30076  102.78559   93.29053
1          3          1          3
2   95.58771        NaN        NaN
3          2        NaN        NaN
PaulS
  • 21,159
  • 2
  • 9
  • 26