1

I am trying use pandas to resample vessel tracking data from seconds to minutes using how='first'. The dataframe is called hg1s. The unique ID is called MMSI. The datetime index is TX_DTTM. Here is a data sample:

            TX_DTTM       MMSI        LAT        LON         NS
2013-10-01 00:00:02  367542760  29.660550 -94.974195         15   
2013-10-01 00:00:04  367542760  29.660550 -94.974195         15   
2013-10-01 00:00:07  367451120  29.614161 -94.954459          0   
2013-10-01 00:00:15  367542760  29.660210 -94.974069         15   
2013-10-01 00:00:13  367542760  29.660210 -94.974069         15   

The code to resample:

hg1s1min = hg1s.groupby('MMSI').resample('1Min', how='first')

And a data sample of the output:

 hg1s1min[20000:20004]
             MMSI             TX_DTTM                  NS      LAT  LON
        367448060 2013-10-21 00:42:00                 NaN      NaN  NaN        
                  2013-10-21 00:43:00                 NaN      NaN  NaN        
                  2013-10-21 00:44:00                 NaN      NaN  NaN      
                  2013-10-21 00:45:00                 NaN      NaN  NaN   

It's safe to assume that there are several data points within each minute, so I don't understand why this isn't picking up the first record for that method. I looked at this link: Pandas Downsampling Issue because it seemed similar to my problem. I tried passing label='left' and label='right', neither worked.

How do I return the first record in every minute for each MMSI?

Community
  • 1
  • 1
user3512166
  • 121
  • 1
  • 7
  • I can't seem to replicate the issue on the small sample of data that was provided. Could you post a minimal example that demonstrates the `NaN`s? – jme Nov 28 '15 at 16:07

1 Answers1

0

As it turns out, the problem isn't with the method, but with my assumption about the data. The large data set is a month, or 44640 minutes. While every record in my dataset has the relevant values, there isn't 100% overlap in time. In this case MMSI = 367448060 is present at 2013-10-17 23:24:31 and again at 2013-10-29 20:57:32. between those two data points, there isn't data to sample, resulting in a NaN, which is correct.

user3512166
  • 121
  • 1
  • 7