3

I have a data frame in which there is a column containing several NaN values. The dataframe looks like this:

            col_1   col_2
2022-10-31  99.094  102.498
2022-11-30  99.001  101.880
2022-12-31     NaN  108.498
2023-01-31     NaN  100.500

I want to fill those NaN based on the simple calculation below: desired_val = (previous value in col_1 * current value in col_2) / previous value in col_2

which means,

df.loc['2022-12-31', 'col_1'] should be = (99.001 * 108.498) / 101.880 = 105.432

and df.loc['2023-01-31', 'col_1'] should be = (105.432 * 100.500) / 108.498 = 97.660

I found solution by using row by row operation but it is slow when the dataset is big. I tried column wise operation by using this:

df['col_1'] = df['col_1'].fillna(
        (df[col_1].shift(1) * df[col_2])
        / df[col_2].shift(1)
       )

But it does work only for one row and then it does not go further. Is there any column wise pandas solution for that?

EMT
  • 458
  • 3
  • 14
  • Note that if you repeat your code in a loop this works, it's just that the previous col_1 value is not yet calculated, thus my use of `ffill` in my answer ;) – mozway Jun 13 '23 at 13:49

4 Answers4

4

You can think of your operation and see that you multiply by x in one row and divide by x in the next row.

Thus you can simplify the result to:

col1_value = (last_valid_col1_value * current_col2_value)
             / col2_value_at_last_valid_col1_position

Which can be translated as:

# is the row a NA?
m1 = df['col_1'].isna()
# is the next row a NA?
m2 = df['col_1'].shift(-1).isna()

df.loc[m1, 'col_1'] = (df['col_1'].div(df['col_2'])
                       .where(m2 & ~m1).ffill()
                       .mul(df['col_2'])[m1]
                      )

Output:

                 col_1    col_2
2022-10-31   99.094000  102.498
2022-11-30   99.001000  101.880
2022-12-31  105.431984  108.498
2023-01-31   97.659997  100.500

Intermediates:

             col_1    col_2     m1     m2  m2&~m1  ffilled(col1/col2)      result
2022-10-31  99.094  102.498  False  False   False                 NaN         NaN
2022-11-30  99.001  101.880  False   True    True            0.971741         NaN
2022-12-31     NaN  108.498   True   True   False            0.971741  105.431984
2023-01-31     NaN  100.500   True   True   False            0.971741   97.659997
mozway
  • 194,879
  • 13
  • 39
  • 75
1

Another possible solution, which uses a mathematical formula (under the assumption that col_1 is only NaN after the first NaN):

col1_NaN = last_not_NaN_of_col1 * 
           cumprod(col2_after_last_not_NaN_col1/col2.shift)

df.assign(col_1 = df['col_1'].combine_first(
    df['col_1'][df['col_1'].last_valid_index()] * 
    df['col_2'].div(df['col_2'].shift())[df['col_1'].isna()].cumprod()))

Output:

                 col_1    col_2
2022-10-31   99.094000  102.498
2022-11-30   99.001000  101.880
2022-12-31  105.431984  108.498
2023-01-31   97.659997  100.500
PaulS
  • 21,159
  • 2
  • 9
  • 26
0

This works! Unsure how fast it will be, but it works.

It works by going row by row through the dataframe, checking for NaNs, and updating the dataframe as it goes. The issue with vectorised solutions like the one you posted is they update everything simultaneously, hence only the first match being updated.

import pandas as pd

data = {'col_1': [99.094, 99.001, None, None],
        'col_2': [102.498, 101.880, 108.498, 100.500]}
index = pd.to_datetime(['2022-10-31', '2022-11-30', '2022-12-31', '2023-01-31'])

df = pd.DataFrame(data, index=index)

previous_col1 = None
previous_col2 = None

for index, row in df.iterrows():
    if pd.isnull(row['col_1']) and previous_col1 is not None and previous_col2 is not None:
        df.loc[index, 'col_1'] = (previous_col1 * row['col_2']) / previous_col2
    previous_col1 = df.loc[index, 'col_1']
    previous_col2 = row['col_2']

"""
                 col_1    col_2
2022-10-31   99.094000  102.498
2022-11-30   99.001000  101.880
2022-12-31  105.431984  108.498
2023-01-31   97.659997  100.500
"""
Mark
  • 7,785
  • 2
  • 14
  • 34
  • Actually I already solved with iterrows function. As I said, I am trying to avoid row wise operation. – EMT Jun 13 '23 at 13:36
  • Ah sorry I misread your post! I think it because you are indeed updating the column row by row it's impossible for it to be both updating each row and not be row by row (unless one were to repeatedly run the vectorised version until no NaNs remained, but I image that would be the slowest option by far) – Mark Jun 13 '23 at 13:43
  • 1
    You dont need to be! I appreciate your effort and welcome to stackoverflow world! – EMT Jun 13 '23 at 13:44
0

I have two different suggestions. Because of the risk of two or more NaN values following each other in column_1 you can not do a simple vectorized approach.

Solution 1 will simply iterate and solve it one row at a time. Solution 2 will perform a vectorized approach but only to the subset of values that will allow it (so to missing values in columnd_1 that have non-missing values above it), and then repeat until all values are filled.

I am not sure which will be faster for your dataset. (I have assumed that column_2 has no missing values).

Solution 1:

# Iterate over each row
for i in df.index:
    # check if value in col_1 is NaN
    if pd.isnull(df.loc[i, 'col_1']):
        df.loc[i, 'col_1'] = (df.loc[i-1, 'col_1'] * df.loc[i, 'col_2']) / df.loc[i-1, 'col_2']

Solution 2:

while df['col_1'].isnull().any():
    # Identify rows where col_1 is NaN and the previous row in col_1 is not NaN
    mask = df['col_1'].isnull() & df['col_1'].shift().notna()
    
    # Apply the formula only on those rows
    df.loc[mask, 'col_1'] = (df.loc[mask, 'col_1'].shift() * df.loc[mask, 'col_2']) / df.loc[mask, 'col_2'].shift()

Solution 2 identifies all rows where col_1 is NaN and the previous row in col_1 is not NaN. The formula is then applied only to those rows. This process is repeated until all NaN values in col_1 are filled.

With a 50000 row set of synthetically generated data similar to yours, with approx. 10000 NaN values, including some double and some triple NaN values following each other in column_1, but no NaN values in column_2, Solution 1 executed in 2s 60ms and Solution 2 executed in 1s 388ms. I also tried both solutions for data with less and with more NaN values. Both performed slower and faster respecitvely and the speed difference was approx the same, with Solution 2 always being faster.

Code for generating similar data:

import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(0)

# Create date range
dates = pd.date_range('2022-10-31', periods=50000)

# Create random data for col_2
col_2 = np.random.uniform(100, 110, 50000)

# Create data for col_1, with NaN at specific intervals
col_1 = np.where((np.arange(50000) % 10) == 0, np.nan, np.random.uniform(90, 100, 50000))

# Adding consecutive NaNs at different intervals
for i in range(len(col_1)):
    if i % 15 == 0:
        if i+1 < len(col_1):  # Making sure we don't go out of bounds
            col_1[i+1] = np.nan
    if i % 40 == 0:
        if i+2 < len(col_1):  # Making sure we don't go out of bounds
            col_1[i+2] = np.nan
    if i % 60 == 0:
        if i+3 < len(col_1):  # Making sure we don't go out of bounds
            col_1[i+3] = np.nan
    if i % 120 == 0:
        if i+4 < len(col_1):  # Making sure we don't go out of bounds
            col_1[i+4] = np.nan

# Create DataFrame
df = pd.DataFrame({'col_1': col_1, 'col_2': col_2}, index=dates)

# Making sure the first value is not NaN
df.loc[dates[0], 'col_1'] = 99.094

print(df.head(20))

# Count NaNs in col_1
num_nan = df['col_1'].isnull().sum()
print(f"The number of NaN values in col_1 is {num_nan}")

This generates some sections that have up to 5 NaN values following each other, and still Solution 2 was faster (Solution 1: 2s 460ms, Solution 2: 1s 621ms)

Glogg
  • 28
  • 5
  • I tried mozway's code on the 50000 and it executes in around 10ms so definitely the better option. – Glogg Jun 13 '23 at 14:24
  • As I mentioned, row wise solution is done but I discarded due to lower speed. Both of these either going row by row or running a for loop which reduces the benefit of using pandas. – EMT Jun 13 '23 at 14:51
  • Yes, but to get to the solution that mozway did, you need huge brains and huge experience and I clearly lack at least one of those. – Glogg Jun 14 '23 at 06:26
  • After thinking about the logic in mozways code, I believe the forward fill will give same value to use also to all the next NaN for the calculation so if there are more than two NaN in a row in your data, you will not get the correct values in any row that has two or more NaN above it. Please correct me if I am wrong. – Glogg Jun 15 '23 at 22:17