0

I have an 80,000 rows csv file made up of four columns ID, Date, Time and Flow. If flow data is ever missing the missing data is skipped over until a new flow data is record and then the data continues to record. Flow measurements are taken every 15 minutes.

Example:

USGS    2/12/2023   0:45    167
USGS    2/12/2023   1:00    170
USGS    2/12/2023   1:15    177
USGS    2/12/2023   1:45    170
USGS    2/12/2023   2:00    164
USGS    2/12/2023   2:15    177
USGS    2/12/2023   2:30    170
USGS    2/12/2023   2:45    180

Here 1:30 is missing from the Feb 12th 2023 record data. These missing data can be a one off or could occur over multiple hours or days.

I'm trying to write a python script which search out the missing timesteps and whenever it finds a skipped/missing row it add a replacement row into the missing location(s) with the correct ID, date, time and NA for flow.

Example

USGS    2/12/2023   0:45    167
USGS    2/12/2023   1:00    170
USGS    2/12/2023   1:15    177
USGS    2/12/2023       1:30    NA
USGS    2/12/2023   1:45    170
USGS    2/12/2023   2:00    164
USGS    2/12/2023   2:15    177
USGS    2/12/2023   2:30    170
USGS    2/12/2023   2:45    180

or

USGS    1/16/2023   23:00   329
USGS    1/16/2023   23:15   329
USGS    1/16/2023   23:30   329
USGS    1/16/2023   23:45   NA
USGS    1/17/2023   0:00    NA
USGS    1/17/2023   0:15    NA
USGS    1/17/2023   0:30    329
USGS    1/17/2023   0:45    329
USGS    1/17/2023   1:00    329
USGS    1/17/2023   1:15    329
USGS    1/17/2023   1:30    329

Currently, I'm able to find solution where we replace single values within the dataset, such as a time value of 1:30 or a flow value nothing describing the replacement of an entire row of missing data.

Chris
  • 15,819
  • 3
  • 24
  • 37
  • Create a dictionary that uses the time as the keys. Then you can loop over all the 15-minute timestamps in the time range. If that key is missing, add it with the default data. – Barmar Feb 21 '23 at 21:27

1 Answers1

0

You could do this with pandas and it should be pretty fast.

import pandas as pd
df = pd.DataFrame({'ID': ['USGS', 'USGS', 'USGS', 'USGS', 'USGS', 'USGS', 'USGS', 'USGS'],
 'Date': ['2/12/2023',
  '2/12/2023',
  '2/12/2023',
  '2/12/2023',
  '2/12/2023',
  '2/12/2023',
  '2/12/2023',
  '2/12/2023'],
 'Time': ['0:45', '1:00', '1:15', '1:45', '2:00', '2:15', '2:30', '2:45'],
 'Flow': [167, 170, 177, 170, 164, 177, 170, 180]})


df['dt'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])

df = df.set_index('dt').groupby('ID')['Flow'].resample('15T').mean().reset_index()

# Replace # with - if on linux system
df['Date'] = df['dt'].dt.strftime('%#m/%d/%Y')
df['Time'] = df['dt'].dt.strftime('%#H:%M')

df = df.drop(columns=['dt'])[['ID','Date','Time','Flow']]

print(df)

Output

     ID       Date  Time   Flow
0  USGS  2/12/2023  0:45  167.0
1  USGS  2/12/2023  1:00  170.0
2  USGS  2/12/2023  1:15  177.0
3  USGS  2/12/2023  1:30    NaN
4  USGS  2/12/2023  1:45  170.0
5  USGS  2/12/2023  2:00  164.0
6  USGS  2/12/2023  2:15  177.0
7  USGS  2/12/2023  2:30  170.0
8  USGS  2/12/2023  2:45  180.0
Chris
  • 15,819
  • 3
  • 24
  • 37
  • I've tried the same resampling on the OP's 2nd sample which started with `23:00` Time, and it does not give a valid (initial) resulting values, it started with `00:30` instead of `23:00` – RomanPerekhrest Feb 21 '23 at 21:57