0

I have data that looks like this:

Timestamp Variable A Variable B Variable C
2023-01-01 00:00:00.000 Value Nan Nan
2023-01-01 00:00:00.050 Value Nan Nan
2023-01-01 00:00:00.150 Nan Value Value
2023-01-01 05:00:00.450 Nan Value Value
2023-01-01 05:00:00.500 Value Nan Nan

Variable A comes from one source and Variable B and C from a second, so the timestamps are different which gives me a lot of missing data (the data comes from a dozen sources, this is just a simplified example).

I tried to use the resample method of pandas. But this method creates too many lines for me, in the example above there would be new lines between 2023-01-01 00:00:00.150 and 2023-01-01 05:00:00.450, which I do not want.

Can you help me? Thank you

guir
  • 13
  • 1

2 Answers2

0

One option could be to use the successive time difference and a threshold to group your data with groupby.first (or groupby.mean):

out = df.groupby(df['Timestamp'].diff().ge('1s').cumsum(), as_index=False).first()

Output:

                Timestamp  Variable A  Variable B  Variable C
0 2023-01-01 00:00:00.000         1.0         3.0         4.0
1 2023-01-01 05:00:00.450         7.0         5.0         6.0

With resample:

out = df.resample('5H', on='Timestamp').mean().reset_index()

Output:

            Timestamp  Variable A  Variable B  Variable C
0 2023-01-01 00:00:00         1.5         3.0         4.0
1 2023-01-01 05:00:00         7.0         5.0         6.0

Used input:

                Timestamp  Variable A  Variable B  Variable C
0 2023-01-01 00:00:00.000         1.0         NaN         NaN
1 2023-01-01 00:00:00.050         2.0         NaN         NaN
2 2023-01-01 00:00:00.150         NaN         3.0         4.0
3 2023-01-01 05:00:00.450         NaN         5.0         6.0
4 2023-01-01 05:00:00.500         7.0         NaN         NaN
mozway
  • 194,879
  • 13
  • 39
  • 75
0

I do not know what is your context, but .fillna is a good method.

df = df.astype(str).fillna('')

It will change the columns to string and fill missing values with an empty string. I you need to keep a specific datatype you can do it separately and provide a filling value for each.

David
  • 47
  • 2
  • 11