1

I have a dataframe where the index is a datetimeindex, and every row is every day over the course of a couple years. I need to resample the dataframe by month where the two float columns are summed, but the string columns are all the unique values during that month. I can do the resampling to a single column, but I don't know how to do it to everything, or how to combine them back together if I do it one at a time.

For the floats I was trying:

# go through the column list
for col in col_list:
    # process all run time columns for month
    if "float" in str(col):
        # resample for one month and sum
        df[col] = df[col].resample('M').sum()
        # rename the column
        df.rename(columns={col: col + " MONTHLY"}, inplace=True)

and for the strings:

elif "string" in str(col):
    # get all the unique jobs run during the month
    df[col] = df[col].groupby(pd.Grouper(freq='M')).unique()
    df.rename(columns={col: col + " MONTHLY"}, inplace=True)

these were resulting in the monthly data being inserted into tho the dataframe with every day still existing though, and was hard to find and not what I need.

Some sample data:

        float_1 float_2 string_1    string_2
12/30/2019  1   2   a   a
12/31/2019  1   3   a   b
1/1/2020    2   4   a   c
1/2/2020    3   5   b   d

The expected output would be:

12/2019 2   5   a    a, b
1/2020  5   9   a, b c, d

Not sure if it matters but the real data does have NaN in random days throughout the data.

Chris Macaluso
  • 1,372
  • 2
  • 14
  • 33

1 Answers1

4

Try aggregate numeric columns and non numeric columns separately and then join them back:

df.index = pd.to_datetime(df.index)

numerics = df.select_dtypes('number').resample('M').sum()
strings = df.select_dtypes('object').resample('M').agg(lambda x: ','.join(set(x)))

numerics.join(strings)
#            float_1  float_2 string_1 string_2
#2019-12-31        2        5        a      a,b
#2020-01-31        5        9      a,b      d,c
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • That worked pretty well. I don't know why but the `strings` also returned the numeric columns, although combining the strings by month worked. The numeric column worked as intended and returned only the numeric columns summed by month. This was a problem because when joined all the string columns contained `_x`, `_y` columns. Solved with a loop to drop columns containing a list of words represented in all string column names: `strings.drop(list(strings.filter(regex=l)), axis = 1, inplace = True)`. The real dataset has ~80 columns. Thanks for the help!! Good solution. – Chris Macaluso Jun 27 '21 at 21:32
  • 1
    Ok, found my problem from the above comment, the `_x` and `_y` thing was my fault. No problems at all with your solution. Thank you. – Chris Macaluso Jun 27 '21 at 22:22