I have two different suggestions. Because of the risk of two or more NaN values following each other in column_1 you can not do a simple vectorized approach.
Solution 1 will simply iterate and solve it one row at a time.
Solution 2 will perform a vectorized approach but only to the subset of values that will allow it (so to missing values in columnd_1 that have non-missing values above it), and then repeat until all values are filled.
I am not sure which will be faster for your dataset. (I have assumed that column_2 has no missing values).
Solution 1:
# Iterate over each row
for i in df.index:
# check if value in col_1 is NaN
if pd.isnull(df.loc[i, 'col_1']):
df.loc[i, 'col_1'] = (df.loc[i-1, 'col_1'] * df.loc[i, 'col_2']) / df.loc[i-1, 'col_2']
Solution 2:
while df['col_1'].isnull().any():
# Identify rows where col_1 is NaN and the previous row in col_1 is not NaN
mask = df['col_1'].isnull() & df['col_1'].shift().notna()
# Apply the formula only on those rows
df.loc[mask, 'col_1'] = (df.loc[mask, 'col_1'].shift() * df.loc[mask, 'col_2']) / df.loc[mask, 'col_2'].shift()
Solution 2 identifies all rows where col_1
is NaN and the previous row in col_1
is not NaN. The formula is then applied only to those rows. This process is repeated until all NaN values in col_1
are filled.
With a 50000 row set of synthetically generated data similar to yours, with approx. 10000 NaN values, including some double and some triple NaN values following each other in column_1, but no NaN values in column_2, Solution 1 executed in 2s 60ms and Solution 2 executed in 1s 388ms. I also tried both solutions for data with less and with more NaN values. Both performed slower and faster respecitvely and the speed difference was approx the same, with Solution 2 always being faster.
Code for generating similar data:
import pandas as pd
import numpy as np
# Set seed for reproducibility
np.random.seed(0)
# Create date range
dates = pd.date_range('2022-10-31', periods=50000)
# Create random data for col_2
col_2 = np.random.uniform(100, 110, 50000)
# Create data for col_1, with NaN at specific intervals
col_1 = np.where((np.arange(50000) % 10) == 0, np.nan, np.random.uniform(90, 100, 50000))
# Adding consecutive NaNs at different intervals
for i in range(len(col_1)):
if i % 15 == 0:
if i+1 < len(col_1): # Making sure we don't go out of bounds
col_1[i+1] = np.nan
if i % 40 == 0:
if i+2 < len(col_1): # Making sure we don't go out of bounds
col_1[i+2] = np.nan
if i % 60 == 0:
if i+3 < len(col_1): # Making sure we don't go out of bounds
col_1[i+3] = np.nan
if i % 120 == 0:
if i+4 < len(col_1): # Making sure we don't go out of bounds
col_1[i+4] = np.nan
# Create DataFrame
df = pd.DataFrame({'col_1': col_1, 'col_2': col_2}, index=dates)
# Making sure the first value is not NaN
df.loc[dates[0], 'col_1'] = 99.094
print(df.head(20))
# Count NaNs in col_1
num_nan = df['col_1'].isnull().sum()
print(f"The number of NaN values in col_1 is {num_nan}")
This generates some sections that have up to 5 NaN values following each other, and still Solution 2 was faster (Solution 1: 2s 460ms, Solution 2: 1s 621ms)