1

I have a dataset that looks something like the following:

df = pd.DataFrame({"Date":['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05','2021-01-06','2021-01-07'],'Value':[0,0,14,0,0,0,9]})
df['Date']=pd.to_datetime(df['Date'])
df

Date    Value
2021-01-01  0
2021-01-02  0
2021-01-03  14
2021-01-04  0
2021-01-05  0
2021-01-06  0
2021-01-07  9

I know that where the data is missing was due to a lack of reporting, so rows with values represents that day, plus the sum of values from the missing days as well. I want the outcome to randomly distribute the data backwards, based on the existing values, example below:

df2 = pd.DataFrame({"Date":['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05','2021-01-06','2021-01-07'],'Value':[2,8,4,3,1,4,1]})
df2['Date']=pd.to_datetime(df2['Date'])
df2

Date    Value
2021-01-01  2
2021-01-02  8
2021-01-03  4
2021-01-04  3
2021-01-05  1
2021-01-06  4
2021-01-07  1

(The local 'totals' on 2021-01-03 and 2021-01-07 remain the same)

I know part of the problem is that the intervals of missing/present data isn't consistent...

Any ideas on how to get this done? All advice appreciated.

1 Answers1

0

You can create a group up to a non zero Value with ne (not equal) to 0, shift to keep the non-zero value in the right group and cumsum.

Then to create a split per group of difference length summing to the non zero value of the group, you can refer to this question for example.

so using numpy.random.multinomial like in this answer you get:

import numpy as np
np.random.seed(1)

df['new_value'] = (
    df.groupby(df['Value'].ne(0).shift(1,fill_value=False).cumsum())
      ['Value'].apply(lambda x: np.random.multinomial(x.max(), [1./len(x)]*len(x)))
      .explode() # create a Series
      .to_numpy() # because index alignment not possible
)
print(df)
        Date  Value new_value
0 2021-01-01      0         4
1 2021-01-02      0         6
2 2021-01-03     14         4
3 2021-01-04      0         0
4 2021-01-05      0         2
5 2021-01-06      0         2
6 2021-01-07      9         5

or you can also use this answer that seems a bit more popular:

import random

def constrained_sum_sample_pos(n, total):
    """Return a randomly chosen list of n positive integers summing to total.
    Each such list is equally likely to occur."""

    dividers = sorted(random.sample(range(1, total), n - 1))
    return [a - b for a, b in zip(dividers + [total], [0] + dividers)]

df['new_value_r'] = (
    df.groupby(df['Value'].ne(0).shift(1,fill_value=False).cumsum())
      ['Value'].apply(lambda x: constrained_sum_sample_pos(len(x), x.max()))
      .explode()
      .to_numpy()
)
print(df)
        Date  Value new_value new_value_r
0 2021-01-01      0         4           5
1 2021-01-02      0         6           2
2 2021-01-03     14         4           7
3 2021-01-04      0         0           1
4 2021-01-05      0         2           2
5 2021-01-06      0         2           5
6 2021-01-07      9         5           1
Ben.T
  • 29,160
  • 6
  • 32
  • 54