3

I've got a timeseries of intermitent daily data like this.

import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2020-01-01', '2020-01-02', '2020-01-02','2020-01-02','2020-01-03','2020-01-04','2020-01-07','2020-01-08','2020-01-08','2020-01-10','2020-01-13','2020-01-15'], 
               'Price': [200, 324, 320, 421, 240, np.NaN, 500, 520, 531, np.NaN, 571, np.NaN]})
df['Date']= pd.to_datetime(df['Date'])
df.set_index('Date')
df

Result:

+------------+-------+
| Date       | Price |
+------------+-------+
| 2020-01-01 | 200   |
+------------+-------+
| 2020-01-02 | 324   | 
+------------+-------+
| 2020-01-02 | 320   | -- 1st duplicate for 2020-01-02
+------------+-------+
| 2020-01-02 | 421   | -- 2nd duplicate for 2020-01-02
+------------+-------+
| 2020-01-03 | 240   |
+------------+-------+
| 2020-01-04 | NaN   |
+------------+-------+
| 2020-01-07 | 500   |
+------------+-------+
| 2020-01-08 | 520   |
+------------+-------+ 
| 2020-01-08 | 531   | -- 1st duplicate for 2020-01-08 
+------------+-------+ 
| 2020-01-10 | NaN   |
+------------+-------+ 
| 2020-01-13 | 571   |
+------------+-------+ 
| 2020-01-15 | NaN   |
+------------+-------+ 

I need to fill the NaN values with prices from nearest available date where there is more than 1 price recorded (duplicate) i.e.

  • 320 should be moved from 2020-01-02 to 2020-01-04
  • 421 from 2020-01-02 to 2020-01-10
  • 531 from 2020-01-08 to 2020-01-15
storeguy
  • 31
  • 5
  • 1
    Can you show us what you have tried? In what format is the data (maybe list of lists?) – urban Nov 24 '20 at 11:53
  • Assume the data is in a dataframe. I have not tried solving this myself yet, but did make sure that there is no similar question asked (and answered) anywhere else on the internet. – storeguy Nov 24 '20 at 11:56
  • 1
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html Help yourself – ombk Nov 24 '20 at 11:57
  • 1
    welcome to Stackoverflow. It is easier for others to test solutions to your problems if you include code to create your dataframe, rather than show it printed. See [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – piterbarg Nov 24 '20 at 11:58
  • `df.fillna(method='ffill')` – ombk Nov 24 '20 at 11:58
  • @storeguy now your data makes sense. – ombk Nov 24 '20 at 12:23

3 Answers3

3

Here is a Pandas solution, step by step

First, we groupby Price by Date and put them in a list for each date, that we then unwrap into separate columns, which we can then rename

df2 = (
    df.groupby('Date')['Price']
    .apply(list)
    .apply(pd.Series)
    .rename(columns = {0:'Price',1:'Other'})
)
df2

so we get


           Price    Other
Date        
2020-01-01  200.0   NaN
2020-01-02  324.0   320.0
2020-01-03  240.0   NaN
2020-01-04  NaN     NaN
2020-01-07  500.0   NaN
2020-01-08  520.0   NaN

Here Price has the first price for that date, and Other the second price for that date, if available

Now we ffill() Other, so that propagates second values forward until the new second value is found etc.

df2['Other'] = df2['Other'].ffill()

so we get

            Price   Other
Date        
2020-01-01  200.0   NaN
2020-01-02  324.0   320.0
2020-01-03  240.0   320.0
2020-01-04  NaN     320.0
2020-01-07  500.0   320.0
2020-01-08  520.0   320.0

Now we can replace NaNs in the Price column with the values from Other column, and drop Other:

df2['Price'] = df2['Price'].fillna(df2['Other'])
df2.drop(columns = ['Other'], inplace = True)
df2

to get


            Price
Date    
2020-01-01  200.0
2020-01-02  324.0
2020-01-03  240.0
2020-01-04  320.0
2020-01-07  500.0
2020-01-08  520.0
piterbarg
  • 8,089
  • 2
  • 6
  • 22
  • nice! I was also looking for a more pandas-like approach (will update my answer if I can complete it :) ) – urban Nov 24 '20 at 13:58
  • Great approach, simple and elegant! I know it is not a part of the sample data provided, but what if there are 3 or more duplicates per date? Is there a way to programatically address this for n duplicates? – storeguy Nov 24 '20 at 22:30
  • it will work for 3 or more, it will just always use the 2nd to fill. with 3 or more it becomes quite ambiguous as to which column do you want to roll where. probably should be a separate question! – piterbarg Nov 24 '20 at 22:41
  • @piterbarg Thanks, I went ahead and updated the question here. Happy to raise another question, but in my view it may be deemed as duplicates as the underlying problem is the same. – storeguy Nov 25 '20 at 01:02
0

I have never been good with pandas so I treat the frame more-or-less as 2D array (thus there might be more efficient ways to do this with pandas... My attempt on this is the 2nd solution)

So, the idea is:

  • Loop your frame row-by-row
  • Always keep a pointer to the previous row so you can compare Dates (and values)
  • When a duplicate Date is found, set the last_dup_row_index which is always points to the latest "movable" row (see inline comments for edge cases)
  • While iterating, if you hit a missing Price, add a "move" into to_move_indexes. This is a list of tuples of moves that can be performed

At the end of the above loop you have all you need to modify your frame:

  1. The possible price moves
  2. The indexes you move from so you can delete those rows if you want to

The code:

import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2020-01-01', '2020-01-02', '2020-01-02','2020-01-03','2020-01-04','2020-01-07','2020-01-08'],
               'Price': [200, 324, 320, 240, np.NaN, 500, 520]})
df['Date']= pd.to_datetime(df['Date'])
df.set_index('Date')

prev_row = None
last_dup_row_index = None
to_move_indexes = []

for index, row in df.iterrows():
    # Check if we have a None in the current row and
    # a duplicate row waiting for us
    if pd.isna(row['Price']) and last_dup_row_index is not None:
        print(f"To move price from {last_dup_row_index} to {index}")
        to_move_indexes.append((last_dup_row_index, index))


    # Check if this row and the prev one have the
    # same date
    if prev_row is not None and prev_row['Date'] == row['Date']:
        # There is a case where for the same Date you have
        # two entries out of which one is NaN. Here use the
        # other one
        if not pd.isna(row['Price']):
            print(f"Setting duplicate to: idx={index},\n{row}")
            last_dup_row_index = index
        elif not pd.isna(prev_row['Price']):
            print(f"Setting duplicate to: idx={index - 1},\n{prev_row}")
            last_dup_row_index = index - 1
        else:
            # There is an edge case where two NaNs follow each
            # other - not changing last duplicate
            print(f"Warning: two NaN @{row['Date']}")

    prev_row = row

print(to_move_indexes)

# Perform moves
for from_idx, to_idx in to_move_indexes:
    df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
print("\nFrame after moves:")
print(df)

# Preform deletes if you need to
df.drop([x for x, _ in to_move_indexes], inplace=True)
print("\nFrame after deletes:")
print(df)

And the output you get is:

# Here we detected that row index 2 is a duplicate (with index 1)
Setting duplicate to: idx=2,
Date     2020-01-02 00:00:00
Price                    320
Name: 2, dtype: object

# Here we see that row index 4 is missing Price. However
# we have the previous duplicate (2) waiting for us so we
# add a "move" as (2, 4) to our list  
To move price from 2 to 4

# The final list is
[(2, 4)]

Frame after moves:
        Date  Price
0 2020-01-01  200.0
1 2020-01-02  324.0
2 2020-01-02  320.0
3 2020-01-03  240.0
4 2020-01-04  320.0
5 2020-01-07  500.0
6 2020-01-08  520.0

Frame after deletes:
        Date  Price
0 2020-01-01  200.0
1 2020-01-02  324.0
3 2020-01-03  240.0
4 2020-01-04  320.0
5 2020-01-07  500.0
6 2020-01-08  520.0

UPDATE: Second way

# Calculate diff on dates column and keep the
# ones that are same (returns series)
dups = df.Date.diff() == "0 days"

# A cryptic (for me) way to get all the indexes
# where the value is True
dup_indexes = dups.index[dups].to_list()

# Now get the indexes where the Price is Nan
nans = pd.isnull(df).any(1)
nan_indexes = nans.index[nans].to_list()

# Create moves: the nan_index should be greater than 
# the dup_index but as close as possible
moves = []
for nan_index in nan_indexes:
    # dup_indexes are sorted so get the last one, 
    # smaller than the nan_index
    dup_index = [x for x in dup_indexes if x < nan_index][-1]
    if dup_index:
        moves.append((dup_index, nan_index))

# Do moves and deletes
for from_idx, to_idx in moves:
    df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
df.drop([x for x, _ in moves], inplace=True)
print(df)
urban
  • 5,392
  • 3
  • 19
  • 45
  • Second approach shows promise. Let me test it on a larger dataset. Same question that I asked @piterbarg what if there are 3 or more duplicates per date? Is there a way to programatically address this for n duplicates? – storeguy Nov 24 '20 at 22:36
-1

How to tackle and intuition behind ffill

What you are looking for is a method called forward fill. Forward fill locates a null value, then it checks if there's any valid values behind it. If so, it uses it.

To understand more on how to apply the method on your data, please check the documentation of pandas fillna here. It is detailed and provides examples, take a careful look at them and understand what each argument does.

Note, in case your previous value is also NaN, therefore, ffill won't change it (obviously).

Pseudo code

Since you have changed the data, I can think of a pseudo-code.

First, collect all the missing data in your table using df[df.Price.isnull()]

Then check for each missing value if there are duplicates prior to it.

If so, choose the closest duplicate and replace it, else, keep it nan.

ombk
  • 2,036
  • 1
  • 4
  • 16
  • @piterbarg there isnt any reference to a duplicate – ombk Nov 24 '20 at 12:08
  • @piterbarg Thank you, was just about to make the same comment. To elaborate the rationale here is to "dissolve" duplicate values into the days with no data. – storeguy Nov 24 '20 at 12:09
  • @storeguy explain more your question, it isn't clear at all, recheck your data table and re-ask the question. – ombk Nov 24 '20 at 12:10
  • @storeguy are you saying that the fill should be made if on a same day, more than 1 price has been recorded? – ombk Nov 24 '20 at 12:13
  • @storeguy so i got that -1 because your data was wrongly posted and unclear question... – ombk Nov 24 '20 at 12:23
  • @ombk The NaN values should be filled with values from days with duplicate data. question updated to make this more clear. – storeguy Nov 24 '20 at 12:23
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225035/discussion-between-ombk-and-storeguy). – ombk Nov 24 '20 at 14:34