1

Suppose I have a pandas dataframe in which the columns index contains dates. I would like to split this dataframe: I want to obtain a sub-dataframe for each calendar week.

Here is the initialization of an example dataframe, along with my two attempts to tackle the problem:

# Imports
import numpy as np  # version 1.20.3
import pandas as pd  # version 1.3.4  

# Dataframe initialization
np.random.seed(0)
col_idx = pd.date_range(start="2022-12-19", end="2023-01-15", freq="D")
data = np.random.rand(5, len(col_idx))
df = pd.DataFrame(data, columns=col_idx)

#-----------------------------------------------------------------------------#
# Attempt 01                                                                  #
#-----------------------------------------------------------------------------#
# Intialize Excel writer
with pd.ExcelWriter("attempt_01.xlsx") as writer:
    # Initialize row index for keeping track of location in Excel sheet
    row_idx = 1

    # Perform grouping along columns axis
    grouper = df.groupby([df.columns.year, df.columns.week], axis=1)
    for weekyear, sub_df in grouper:
        # Here various operations on the data values in `sub_df` would be
        # performed (not relevant to my question)

        # Write `sub_df` to Excel file
        sub_df.to_excel(writer, startrow=row_idx, index=False)

        # Prepare for next loop iteration
        row_idx += len(sub_df) + 3

#-----------------------------------------------------------------------------#
# Attempt 02                                                                  #
#-----------------------------------------------------------------------------#
# Intialize Excel writer
with pd.ExcelWriter("attempt_02.xlsx") as writer:
    # Initialize row index for keeping track of location in Excel sheet
    row_idx = 1

    # Perform grouping along columns axis
    grouper = df.groupby(pd.Grouper(freq="W-MON", axis=1))
    for week, sub_df in grouper:
        # Here various operations on the data values in `sub_df` would be
        # performed (not relevant to my question)

        # Write `sub_df` to Excel file
        sub_df.to_excel(writer, startrow=row_idx, index=False)
        
        # Prepare for next loop iteration
        row_idx += len(sub_df) + 3

The problem with the first attempt is that a sub-dataframe with one single column for 01 Jan 2023 is generated during the last loop iteration. I instead want the 01 Jan 2023 column to be part of the second sub-dataframe (i.e. I want a sub-dataframe containing 26 Dec 2022 through 01 Jan 2023 columns, a full calendar week Monday through Sunday). Additionally this attempt raises a pandas FutureWarning which states that the week accessor has been deprecated (I'm using pandas version 1.3.4).

The second attempt generates a result that is quite confusing to me. I get sub-dataframes with 28 columns and varying amounts of rows.

Would anyone be able to point me in the right direction? Thanks.

Xukrao
  • 8,003
  • 5
  • 26
  • 52
  • I'm using pandas version 1.3.4 <- ... *why?* – Mark Aug 07 '23 at 03:02
  • 1
    you could use `[df.iloc[:, i:i + 7] for i in range(0, len(df.columns), 7)]`, assuming you don't have gaps in your data – Mark Aug 07 '23 at 03:03
  • @Mark Do you mean "why not a more recent pandas version"? – Xukrao Aug 07 '23 at 03:05
  • yes exactly Xukrao – Mark Aug 07 '23 at 03:06
  • @Mark Thanks for the suggestion. While there aren´t any data gaps in the simplified example that I've given here, unfortunately there will be data gaps in my 'real' dataframe. – Xukrao Aug 07 '23 at 03:07
  • 1
    @Mark In an ideal world I would always have the latest package versions so that I could take advantage of all the newest features. In reality I have constraints however. Right now I simply don´t have the time to test compatibility of all Python tools of my company department with new package versions (and fix potential issues). This would be especially critical for pandas, since it has seen a major version increment four months ago that introduced a lot of breaking changes. – Xukrao Aug 07 '23 at 04:16
  • fair enough! thanks for replying – Mark Aug 07 '23 at 04:16

2 Answers2

2

Here's one way of doing it:

# transpose data
df = df.T.reset_index()

# get the date of the previous Monday (Credit: https://stackoverflow.com/a/35613515/4145280)
df['weekstart'] = df['index'].dt.to_period('W').apply(lambda r: r.start_time)

# split on the unique weekstarts, drop the weekstart column, then transpose
[df[df['weekstart'] == week].drop('weekstart', axis = 1).T for week in df['weekstart'].unique()]
Mark
  • 7,785
  • 2
  • 14
  • 34
0

I have now found that a minor modification to my first attempt fixes the results as well as eliminates the FutureWarning. I just had to replace df.columns.year with df.columns.isocalendar().year and replace df.columns.week with df.columns.isocalendar().week. Full solution:

# Imports
import numpy as np
import pandas as pd

# Dataframe initialization
np.random.seed(0)
col_idx = pd.date_range(start="2022-12-19", end="2023-01-15", freq="D")
data = np.random.rand(5, len(col_idx))
df = pd.DataFrame(data, columns=col_idx)

# Intialize Excel writer
with pd.ExcelWriter("attempt_01.xlsx") as writer:
    # Initialize row index for keeping track of location in Excel sheet
    row_idx = 1

    # Perform grouping along columns axis
    grouper = df.groupby(
        [df.columns.isocalendar().year, df.columns.isocalendar().week],
        axis=1
    )
    for weekyear, sub_df in grouper:
        print(weekyear)
        # Here various operations on the data values in `sub_df` would be
        # performed (not relevant to my question)

        # Write `sub_df` to Excel file
        sub_df.to_excel(writer, startrow=row_idx, index=False)

        # Prepare for next loop iteration
        row_idx += len(sub_df) + 3
Xukrao
  • 8,003
  • 5
  • 26
  • 52