I have never been good with pandas so I treat the frame more-or-less as 2D array (thus there might be more efficient ways to do this with pandas... My attempt on this is the 2nd solution)
So, the idea is:
- Loop your frame row-by-row
- Always keep a pointer to the previous row so you can compare Dates (and values)
- When a duplicate Date is found, set the
last_dup_row_index
which is always points to the latest "movable" row (see inline comments for edge cases)
- While iterating, if you hit a missing Price, add a "move" into
to_move_indexes
. This is a list of tuples of moves that can be performed
At the end of the above loop you have all you need to modify your frame:
- The possible price moves
- The indexes you move from so you can delete those rows if you want to
The code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2020-01-01', '2020-01-02', '2020-01-02','2020-01-03','2020-01-04','2020-01-07','2020-01-08'],
'Price': [200, 324, 320, 240, np.NaN, 500, 520]})
df['Date']= pd.to_datetime(df['Date'])
df.set_index('Date')
prev_row = None
last_dup_row_index = None
to_move_indexes = []
for index, row in df.iterrows():
# Check if we have a None in the current row and
# a duplicate row waiting for us
if pd.isna(row['Price']) and last_dup_row_index is not None:
print(f"To move price from {last_dup_row_index} to {index}")
to_move_indexes.append((last_dup_row_index, index))
# Check if this row and the prev one have the
# same date
if prev_row is not None and prev_row['Date'] == row['Date']:
# There is a case where for the same Date you have
# two entries out of which one is NaN. Here use the
# other one
if not pd.isna(row['Price']):
print(f"Setting duplicate to: idx={index},\n{row}")
last_dup_row_index = index
elif not pd.isna(prev_row['Price']):
print(f"Setting duplicate to: idx={index - 1},\n{prev_row}")
last_dup_row_index = index - 1
else:
# There is an edge case where two NaNs follow each
# other - not changing last duplicate
print(f"Warning: two NaN @{row['Date']}")
prev_row = row
print(to_move_indexes)
# Perform moves
for from_idx, to_idx in to_move_indexes:
df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
print("\nFrame after moves:")
print(df)
# Preform deletes if you need to
df.drop([x for x, _ in to_move_indexes], inplace=True)
print("\nFrame after deletes:")
print(df)
And the output you get is:
# Here we detected that row index 2 is a duplicate (with index 1)
Setting duplicate to: idx=2,
Date 2020-01-02 00:00:00
Price 320
Name: 2, dtype: object
# Here we see that row index 4 is missing Price. However
# we have the previous duplicate (2) waiting for us so we
# add a "move" as (2, 4) to our list
To move price from 2 to 4
# The final list is
[(2, 4)]
Frame after moves:
Date Price
0 2020-01-01 200.0
1 2020-01-02 324.0
2 2020-01-02 320.0
3 2020-01-03 240.0
4 2020-01-04 320.0
5 2020-01-07 500.0
6 2020-01-08 520.0
Frame after deletes:
Date Price
0 2020-01-01 200.0
1 2020-01-02 324.0
3 2020-01-03 240.0
4 2020-01-04 320.0
5 2020-01-07 500.0
6 2020-01-08 520.0
UPDATE: Second way
# Calculate diff on dates column and keep the
# ones that are same (returns series)
dups = df.Date.diff() == "0 days"
# A cryptic (for me) way to get all the indexes
# where the value is True
dup_indexes = dups.index[dups].to_list()
# Now get the indexes where the Price is Nan
nans = pd.isnull(df).any(1)
nan_indexes = nans.index[nans].to_list()
# Create moves: the nan_index should be greater than
# the dup_index but as close as possible
moves = []
for nan_index in nan_indexes:
# dup_indexes are sorted so get the last one,
# smaller than the nan_index
dup_index = [x for x in dup_indexes if x < nan_index][-1]
if dup_index:
moves.append((dup_index, nan_index))
# Do moves and deletes
for from_idx, to_idx in moves:
df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
df.drop([x for x, _ in moves], inplace=True)
print(df)