1

I have dataset which has forecast data which is updated ambiguously over a 42 hour period. Here is a sample:

df_old = pd.DataFrame({'IssueDatetime': ['2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00','2010-01-01 09:00:00'],
                   'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
                   'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','NORTHEAST COAST','NORTHEAST COAST'],
                  'forecastTime': ['2010-01-01 09:00:00','2010-01-01 15:00:00','2010-01-01 19:00:00','2010-01-01 09:00:00','2010-01-01 12:00:00'],
                   'forecast_Dir':[150,180,45,45,45],
                   'windSpeed':[20,90,35,45,15]})

The issue is the gaps between the hours of df['forecastTime'] and df['endtime]. I have tried using my limited pandas knowledge to group and resample the data, but, because the dates are repeated I cannot get a datetime index.

Ultimately my goal is to expand the dataframe so the hours in between the original hours in the dataframe have rows of their own right up until the end period...

Example of desired output:

df_new = pd.DataFrame({'IssueDatetime': [ '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00'],
                   'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
                   'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH'],
                  'forecastTime': ['2010-01-01 09:00:00','2010-01-01 10:00:00','2010-01-01 11:00:00','2010-01-01 12:00:00','2010-01-01 13:00:00','2010-01-01 14:00:00','2010-01-01 15:00:00'],
                   'forecast_Dir':[150,150,150,150,150,150,180],
                   'windSpeed':[20,20,20,20,20,20,90]})

Note for the first region, the hours between df['forecastTime'] = '2010-01-01 09:00:00' and df['forecastTime'] = '2010-01-01 15:00:00' should be rows of their own. Essentially I'm looking to upsample to fill in missing hours.


EDIT: - Orginal dataframe

            IssueDatetime             endtime  \
0     2013-01-01 09:00:00 2013-01-03 03:00:00   
1     2013-01-01 09:00:00 2013-01-03 03:00:00   
2     2013-01-01 09:00:00 2013-01-03 03:00:00   
3     2013-01-01 09:00:00 2013-01-03 03:00:00   
4     2013-01-01 09:00:00 2013-01-03 03:00:00   
...                   ...                 ...   
53585 2016-12-30 09:00:00 2017-01-01 03:00:00   
53586 2016-12-30 09:00:00 2017-01-01 03:00:00   
53587 2016-12-30 09:00:00 2017-01-01 03:00:00   
53588 2016-12-30 09:00:00 2017-01-01 03:00:00   
53589 2016-12-30 09:00:00 2017-01-01 03:00:00   

                                     Regions        forecastTime  \
0                                SOUTH COAST 2013-01-01 09:00:00   
1                                SOUTH COAST 2013-01-01 18:00:00   
2                                SOUTH COAST 2013-01-02 06:00:00   
3                                SOUTH COAST 2013-01-02 13:00:00   
4      EAST COAST-CAPE ST  FRANCIS AND SOUTH 2013-01-01 09:00:00   
...                                      ...                 ...   
53585               SOUTHWESTERN GRAND BANKS 2016-12-30 18:00:00   
53586               SOUTHWESTERN GRAND BANKS 2016-12-31 09:00:00   
53587               SOUTHWESTERN GRAND BANKS 2016-12-31 15:00:00   
53588               SOUTHWESTERN GRAND BANKS 2016-12-31 18:00:00   
53589               SOUTHWESTERN GRAND BANKS 2017-01-01 00:00:00   

       forecastHour forecast_Dir forecast_WindSpeed_low  \
0               0.0          270                     35   
1               9.0          270                     25   
2              21.0          225                     15   
3              28.0          270                     35   
4               0.0          270                     35   
...             ...          ...                    ...   
53585           9.0          135                     40   
53586          24.0          135                     40   
53587          30.0          135                     40   
53588          33.0          315                     25   
53589          39.0          315                     25   

      forecast_WindSpeed_gust forecast_WindSpeed_high  \
0                        None                    None   
1                        None                    None   
2                        None                    None   
3                        None                    None   
4                        None                    None   
...                       ...                     ...   
53585                    None                      50   
53586                    None                     50    
53587                    None                      50   
53588                    None                      35   
53589                    None                    None   

      forecast_WindSpeed_exception_1_type forecast_Dir_exception_1  \
0                                     NaN                      NaN   
1                                     NaN                      NaN   
2                                     NaN                      NaN   
3                                     NaN                      NaN   
4                                     NaN                      NaN   
...                                   ...                      ...   
53585                                 NaN                      NaN   
53586          OVER NORTHWESTERN SECTIONS                      315   
53587                                 NaN                      NaN   
53588                                 NaN                      NaN   
53589                                 NaN                      NaN   

      forecast_WindSpeed_low_exception_1 forecast_WindSpeed_high_exception_1  
0                                    NaN                                 NaN  
1                                    NaN                                 NaN  
2                                    NaN                                 NaN  
3                                    NaN                                 NaN  
4                                    NaN                                 NaN  
...                                  ...                                 ...  
53585                                NaN                                 NaN  
53586                                 25                                None  
53587                                NaN                                 NaN  
53588                                NaN                                 NaN  
53589                                NaN                                 NaN  
MrNobody33
  • 6,413
  • 7
  • 19

2 Answers2

2

IIUC, first you need to convert the 'forecastTime' column to datetime, then you set 'forecastTime' column as index to do the resample, after that group by 'Regions', do the resample by hours, and fill the NaN values with ffill:

df_old['forecastTime'] = pd.to_datetime(df_old['forecastTime'])
df_new = df_old.set_index('forecastTime')
               .groupby('Regions',as_index=False)
               .resample('H').ffill().droplevel(0).reset_index()
print(df_new.head())

To avoid a possible error(non-unique index) because of duplicate dates, you can try this:

df_new = df_old.groupby('Regions',as_index=False)
   .apply(lambda x: x.set_index('forecastTime')
   .resample('H').ffill()).droplevel(0).reset_index()

Output:

         forecastTime        IssueDatetime              endtime                               Regions  forecast_Dir  windSpeed
0 2010-01-01 09:00:00  2010-01-01 09:00:00  2010-01-03 03:00:00  EAST COAST-CAPE ST FRANCIS AND SOUTH           150         20
1 2010-01-01 10:00:00  2010-01-01 09:00:00  2010-01-03 03:00:00  EAST COAST-CAPE ST FRANCIS AND SOUTH           150         20
2 2010-01-01 11:00:00  2010-01-01 09:00:00  2010-01-03 03:00:00  EAST COAST-CAPE ST FRANCIS AND SOUTH           150         20
3 2010-01-01 12:00:00  2010-01-01 09:00:00  2010-01-03 03:00:00  EAST COAST-CAPE ST FRANCIS AND SOUTH           150         20
4 2010-01-01 13:00:00  2010-01-01 09:00:00  2010-01-03 03:00:00  EAST COAST-CAPE ST FRANCIS AND SOUTH           150         20
MrNobody33
  • 6,413
  • 7
  • 19
  • When running this on the complete dataset I am getting the following error: ValueError: cannot reindex a non-unique index with a method or limit... I believe this is because the dates occur for several different regions not just "EAST COAST-CAPE ST FRANCIS AND SOUTH" – Jordan Ford Aug 05 '20 at 02:01
  • I guess maybe that could happen. And no, that happens because as you can see there are duplicate dates in the 'forecastTime' column, as in row 0 and 3 of df_old. See the edit! @JordanFord – MrNobody33 Aug 05 '20 at 02:08
  • ah yes, this is the problem I have been running into since the start. Still getting the stubborn error! – Jordan Ford Aug 05 '20 at 02:17
  • Sorry yes, you said the error before. But with the second solution I´m not getting the error, are you testing it with the original sample you posted? Or with your original dataframe?... And if it is with your original dataframe, do you have duplicate dates for a single region? @JordanFord – MrNobody33 Aug 05 '20 at 02:20
  • Testing on the original dataset, and the forecast time only occurs once per region for each specific date. The set spans from 3 years in a datetime format as shown in my post. Thanks for this btw! – Jordan Ford Aug 05 '20 at 02:24
  • And with the second solution you still getting the error? I´m not sure why it is because if the dates only occurs one time by region, it means there isn´t duplicated dates when doing the set_index. So, there is no reason for the error. @JordanFord – MrNobody33 Aug 05 '20 at 02:30
  • Yeah, it does not make sense to me. Your second solution should have done it from what I can tell. I'm lost. Spent about 12 hrs on this issue today lol. – Jordan Ford Aug 05 '20 at 02:41
  • maybe you could add the head of the original dataframe to test on it @JordanFord – MrNobody33 Aug 05 '20 at 03:29
  • Alright a portion of the dataset has been added to the post. – Jordan Ford Aug 05 '20 at 12:15
  • I managed to find a way to almost make it work using your solution. – Jordan Ford Aug 10 '20 at 11:28
0

I've partially answered my own question... I was still getting a multi-index error so I added a millisecond to each forecastTime per row to make them unique:

df_old['forecastTime'] = df_old['forecastTime'] + pd.to_timedelta(df_old.groupby('forecastTime').cumcount(), unit='ms')


df_old['forecastTime'] = pd.to_datetime(df_old['forecastTime'])
df_new = df_old.set_index('forecastTime')
               .groupby('Regions',as_index=False)
               .resample('H').ffill().droplevel(0).reset_index()

Now, I need to add an end time. Any suggestions to ending the .ffill() exactly 42 hours after the stat time of each forecast?