2

I need to generate an array that represents monthly profits like this:

[
    [2008, None, None, None, 100, 100, 100, 100, 100, 100, 100, 100, 100],
    [2009, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100],
    # ecc...
    [2019, 100, 100, 100, 100, 100, 100, 100, 100, None, None, None, None, None]
]

In other words: year, sum of profits for month 1, sum of profits for month 2, ecc with None where there is no info for that month.

from a dataframe of daily profits like this

date
2008-04-01    0.0
2008-04-02    10.0
2008-04-03    10.0
2008-04-04    10.0
2008-04-05    10.0
Name: profit, dtype: float64

I call df["profit"].groupby(pd.Grouper(freq='M')).sum() and get:

date
2008-04-30    100.0
2008-05-31    100.0
2008-06-30    100.0
2008-07-31    100.0
2008-08-31    100.0
Freq: M, Name: profit, dtype: float64

Now I was thinking to do something like this, in pseudo-code that doesn't work:

start = df["date"].min().to_pydatetime()
end = df["date"].max().to_pydatetime()

result = [
    [start.year]
]
idx = 0
for date, monthly_profit in df["profit"].groupby(pd.Grouper(freq='M')).sum().iterrows():
    if date.year !== result[idx][0]:
        idx += 1
        result[idx] = [date.year]

    month = 1
    while month <= 12:
        if date.month == month:
            result[idx].append(monthly_profit)
        else:
            result[idx].append(None)
        month += 1

Is there anyway to do this without iterating? If not, how can I iterate and read the date value?


EDIT, in response to QuickBeam's answer, here's my solution to avoid the problem when not all 12 months are present in the original data:

if len(df.groupby([df.date.dt.month]).agg({"date": "count"})) < 12:
    # All months should be represented by at least one data point for the display to work properly
    # If not all months are present, we insert empty data
    min_year = df["date"].min().to_pydatetime().year
    for m in range(1, 13):
        if df.loc[df.date.dt.month == m].empty:
            df = df.append(pd.DataFrame({"date": datetime(min_year, m, 1), column_name: [np.nan]}))
        else:
            min_year = df.loc[df.date.dt.month == m]["date"].iloc[0].to_pydatetime().year
Saturnix
  • 10,130
  • 17
  • 64
  • 120

1 Answers1

2

More data would be nice, but I think, you could do something like:

df.groupby([df.date.dt.year, df.date.dt.month])["profit"].sum()

then, you can easily use pivot to get the desired shape. Give me a moment, I'll generate some data :)


Okay, so I think I have a nice solution, but you don't need pivot.

import pandas as pd
import numpy as np
date_index = pd.date_range(start="2017-05-05", periods=700)
df = pd.DataFrame(data=np.random.rand(700), index=date_index, columns=["profit"])

now provides your desired output (but as a data frame, not as a list of lists))

df.groupby([df.index.year, df.index.month]).agg({"profit":"sum"}).unstack(-1)

gives

         profit                                                         \
             1          2          3          4          5          6    
2017        NaN        NaN        NaN        NaN  13.671041  16.693129   
2018  16.780003  12.783907  17.340193  13.323846  16.897318  16.671774   
2019  13.718783  14.322513  15.163668   1.606801        NaN        NaN   


             7          8          9          10         11         12  
2017  15.781419  15.357254  16.392586  13.782561  15.242144  15.897317  
2018  15.854918  17.360759  11.516470  17.096427  15.096696  16.593045  
2019        NaN        NaN        NaN        NaN        NaN        NaN 

so you don't need pivot at all.

Note However, you won't get 12 columns if your data do not cover each month of year at some point at time. But you could always append nan data so that you see each month of year at least once :)


So assume now that we don't cover alls the months, as in the following example:

date_index = pd.date_range(start="2017-05-05", periods=100)
df = pd.DataFrame(data=np.random.rand(100), index=date_index, columns=["profit"])
df = df.groupby([df.index.year, df.index.month]).agg({"profit":"sum"}).unstack(-1)

Let's investigate the columns object:

df.columns
MultiIndex([('profit', 5),
            ('profit', 6),
            ('profit', 7),
            ('profit', 8)],
           )

So we don't simply have a list but a MultiIndex. Next, let's define the column labels (as a MultiIndex) we want to have:

requird_columns_multiindex = pd.MultiIndex.from_tuples([("profit", month) for month in range(1,13)])

Finally, we concatenate our Dataframe df, with an empty data frame bearing only the column information:

pd.concat([df, pd.DataFrame(columns=requird_columns_multiindex)])
     profit                                                                 \
         1    2    3    4          5          6          7         8    9    
2017    NaN  NaN  NaN  NaN  12.733439  13.965117  14.504708  5.650205  NaN   


       10   11   12  
2017  NaN  NaN  NaN  
Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • Holy moly... this is brilliant. Thank you!! – Saturnix Aug 31 '19 at 15:24
  • How would you suggest handling cases where I don't have all 12 months? Should I add in empty data in those cases? – Saturnix Aug 31 '19 at 19:44
  • Two option, I can think of. Firstly, add empty sample data for every month that is not in the data. Secondly, probably easier: just check which months are missing after performing the steps from above. Then happen the remaining columns. I think I can post an example tomorrow – Quickbeam2k1 Aug 31 '19 at 19:57
  • Thanks! I've wrote an ugly script to inject placeholder data, posted in the question :) seems to work – Saturnix Aug 31 '19 at 20:20
  • You are using the discouraged [chained indexing](https://www.google.com/url?q=https://stackoverflow.com/questions/41253170/whats-the-alternative-to-pandas-chain-indexing&sa=U&ved=2ahUKEwinqNLZ-q3kAhVByaQKHZxWBP4QFjAAegQICRAB&usg=AOvVaw0N2WqgqJzYVxnOhWcsqgXb) . Anyway there is an easier solution. – Quickbeam2k1 Aug 31 '19 at 20:38