Suppose I have a pandas dataframe in which the columns index contains dates. I would like to split this dataframe: I want to obtain a sub-dataframe for each calendar week.
Here is the initialization of an example dataframe, along with my two attempts to tackle the problem:
# Imports
import numpy as np # version 1.20.3
import pandas as pd # version 1.3.4
# Dataframe initialization
np.random.seed(0)
col_idx = pd.date_range(start="2022-12-19", end="2023-01-15", freq="D")
data = np.random.rand(5, len(col_idx))
df = pd.DataFrame(data, columns=col_idx)
#-----------------------------------------------------------------------------#
# Attempt 01 #
#-----------------------------------------------------------------------------#
# Intialize Excel writer
with pd.ExcelWriter("attempt_01.xlsx") as writer:
# Initialize row index for keeping track of location in Excel sheet
row_idx = 1
# Perform grouping along columns axis
grouper = df.groupby([df.columns.year, df.columns.week], axis=1)
for weekyear, sub_df in grouper:
# Here various operations on the data values in `sub_df` would be
# performed (not relevant to my question)
# Write `sub_df` to Excel file
sub_df.to_excel(writer, startrow=row_idx, index=False)
# Prepare for next loop iteration
row_idx += len(sub_df) + 3
#-----------------------------------------------------------------------------#
# Attempt 02 #
#-----------------------------------------------------------------------------#
# Intialize Excel writer
with pd.ExcelWriter("attempt_02.xlsx") as writer:
# Initialize row index for keeping track of location in Excel sheet
row_idx = 1
# Perform grouping along columns axis
grouper = df.groupby(pd.Grouper(freq="W-MON", axis=1))
for week, sub_df in grouper:
# Here various operations on the data values in `sub_df` would be
# performed (not relevant to my question)
# Write `sub_df` to Excel file
sub_df.to_excel(writer, startrow=row_idx, index=False)
# Prepare for next loop iteration
row_idx += len(sub_df) + 3
The problem with the first attempt is that a sub-dataframe with one single column for 01 Jan 2023 is generated during the last loop iteration. I instead want the 01 Jan 2023 column to be part of the second sub-dataframe (i.e. I want a sub-dataframe containing 26 Dec 2022 through 01 Jan 2023 columns, a full calendar week Monday through Sunday). Additionally this attempt raises a pandas FutureWarning which states that the week
accessor has been deprecated (I'm using pandas version 1.3.4).
The second attempt generates a result that is quite confusing to me. I get sub-dataframes with 28 columns and varying amounts of rows.
Would anyone be able to point me in the right direction? Thanks.