1

I have time series data with each year's data stored in different columns. This data is known to have some data errors which are causing minor holes in the time series data that I'd like to correct.

For reference, the data looks like this

df = pd.DataFrame({'Firm': ['A', 'B', 'C', 'D', 'E', 'F'],
    '2003': [1, 1, 1, 0, 0, 1],
    '2004': [1, 0, 1, 0, 0, 1],
    '2005': [1, 1, 0, 0, 0, 0],
    '2006': [0, 1, 1, 0, 0, 1],
    '2007': [1, 0, 1, 1, 0, 1], 
    '2008': [1, 0, 1, 1, 0, 0]})

I'd like to write a programmatic way to fill the random zeroes that occasionally pop up. Essentially, if a firm has a 1 for 2 years, with a 0 in between in them, I'd like to assume that is a data error and fill in the data. The desired output for this sample data would look like this:

df = pd.DataFrame({'Firm': ['A', 'B', 'C', 'D', 'E', 'F'],
    '2003': [1, 1, 1, 0, 0, 1],
    '2004': [1, 1, 1, 0, 0, 1],
    '2005': [1, 1, 1, 0, 0, 1],
    '2006': [1, 1, 1, 0, 0, 1],
    '2007': [1, 0, 1, 1, 0, 1], 
    '2008': [1, 0, 1, 1, 0, 0]})

If I need to reshape this longways and format it as a time series, I will. First I wanted to see if there was an easy to do this in this wide format though.

Thanks!

mar56
  • 11
  • 1

1 Answers1

0

Try:

df = df.set_index('Firm')
df[:] = np.where(df.apply(lambda x: x.cumsum() * x[::-1].cumsum()[::-1], axis=1), 1, 0)
df = df.reset_index()

print(df)

Prints:

  Firm  2003  2004  2005  2006  2007  2008
0    A     1     1     1     1     1     1
1    B     1     1     1     1     0     0
2    C     1     1     1     1     1     1
3    D     0     0     0     0     1     1
4    E     0     0     0     0     0     0
5    F     1     1     1     1     1     0
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91