3

I want to forward fill 2 columns: Time and X in df:

   Time                     X   Y   Z
0  2020-01-15 06:12:49.213  0   0   0
1  2020-01-15 08:12:49.213  1   2   2
2  2020-01-15 10:12:49.213  3   6   9
3  2020-01-15 12:12:49.213  12  15  4
4  2020-01-15 14:12:49.213  8   4   3   

but keeping remaining columns Y and Z constant, or fill the additional rows with NaN.

I checked Pandas docs for .fillna and .asfreq but they didn't cover forward fill certain columns. While this answer did, it didn't specify frequency.

Expected output(using 10s frequency):

    Time                     X   Y   Z
0   2020-01-15 06:12:49.213  0   0   0
1   2020-01-15 06:12:59.213  0   NaN NaN  # forward filled 
2   2020-01-15 06:13:09.213  0   NaN NaN  # forward filled 
               ...
11  2020-01-15 08:12:49.213  1   2   2
12  2020-01-15 08:12:59.213  1   NaN NaN  # forward filled 
13  2020-01-15 08:13:09.213  1   NaN NaN  # forward filled 
               ...
22  2020-01-15 10:12:49.213  3   6   9
23  2020-01-15 10:12:59.213  3   NaN NaN  # forward filled 

               ...
nilsinelabore
  • 4,143
  • 17
  • 65
  • 122

1 Answers1

2

You can try asfreq to resample times.

Workflow:

  • First we set the Time column as index
  • Sort the index (if there are not, the asfreq method will failed)
  • Let's now extend the dataframe. We operate the resample two times according the method used:

    • If no method is provided (e.g. None), new values are filled with NaN. We use this for columns Y and Z
    • For the X column, the method ffill "propagates last valid observation forward to next valid" doc.

    • As you highlight in comments, the frequency used is important to know if all the values will be kept. If the frequency is too big, some values might not match the interval. As a result, these values will be skipped. To overcome this, a solution might be to use smaller interval (let's say 1s). Using it, the ffill will be correctly applied to all values.

    • However, if you actually want a 10S daterange dataframe, we need to resample. Here, we start to understand, by doing that, we will again delete values not in daterange. But it's not problematic because we already have these values (they are our input). So we can append them to our dataframe using append (like this, we will be sure to have all the values). We might even have duplicates, so remove them using drop_duplicates.

Full example:

# Be sure it's a datetime object
df["Time"] = pd.to_datetime(df["Time"])
print(df)

# Set tme column as index
df.set_index(["Time"], inplace=True)
df = df.sort_index()
print(df)
#                      Time   X   Y  Z
# 0 2020-01-15 06:12:49.213   0   0  0
# 1 2020-01-15 08:12:49.213   1   2  2
# 2 2020-01-15 10:12:49.213   3   6  9
# 3 2020-01-15 11:45:24.213   4   6  9
# 4 2020-01-15 12:12:49.213  12  15  4
# 5 2020-01-15 12:12:22.213  12  15  4
# 6 2020-01-15 14:12:49.213   8   4  3

# Resample
out = df[["Y", "Z"]].asfreq('10S')
out["X"] = df["X"].asfreq('1S', method="ffill").asfreq('10S')

# Reset index
out = out.append(df, sort=True).reset_index().drop_duplicates().reset_index(drop=True)
print(out)
#                         Time   X     Y    Z
# 0    2020-01-15 06:12:49.213   0   0.0  0.0
# 1    2020-01-15 06:12:59.213   0   NaN  NaN
# 2    2020-01-15 06:13:09.213   0   NaN  NaN
# 3    2020-01-15 06:13:19.213   0   NaN  NaN
# 4    2020-01-15 06:13:29.213   0   NaN  NaN
# ...                      ...  ..   ...  ...
# 2878 2020-01-15 14:12:29.213  12   NaN  NaN
# 2879 2020-01-15 14:12:39.213  12   NaN  NaN
# 2880 2020-01-15 14:12:49.213   8   4.0  3.0
# 2881 2020-01-15 11:45:24.213   4   6.0  9.0
# 2882 2020-01-15 12:12:22.213  12  15.0  4.0

# [2883 rows x 4 columns]
Alexandre B.
  • 5,387
  • 2
  • 17
  • 40
  • I think so. You can keep the index in another column. However, since `asfreq` is expending the dataframe, the index are changed – Alexandre B. Apr 13 '20 at 12:34
  • Thank you for the answer! I have a question: can I still set constraint on time to apply further operations? For example, can I calculate the mean of `data` for `time` between `0am` and `4am`? – nilsinelabore Apr 13 '20 at 12:43
  • 1
    Yes it is. That depends on what you want to do. If you're only looking for a specific daterange, you can subset the dataset and apply [`mean`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html). However, if you're looking for a summary for all *4h* daterange, you should have a look at [`pd.Grouper`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html) – Alexandre B. Apr 13 '20 at 12:49
  • Btw, it seems to have removed the original data, which are not at `10second` intervals - is it possible to keep these original data? – nilsinelabore Apr 13 '20 at 12:50