Pandas resample function adds dates that were not in the Dataframe

Question

I have a January 2019 Dataframe with a lot of data.

in the photo the data -only january- is grouped -using .loc function- by Ciclo_Estacion_Retiro== 271

series=read_csv('2019-01.csv',header=0,parse_dates=True,squeeze=True)

series=series.dropna(how='all')

print(series.head())

Out[]:
Genero_Usuario  Edad_Usuario   Bici  Ciclo_Estacion_Retiro Fecha_Retiro  \
0              M            36  11444                     67     01/01/19   
1              M            36   9196                     55     01/01/19   
2              M            38   2612                     29     01/01/19   
3              M            43   7463                     29     01/01/19   
4              M            37   3987                    136     01/01/19   

  Hora_Retiro  Ciclo_Estacion_Arribo Fecha_Arribo Hora_Arribo  
0     0:08:44                     36     01/01/19     0:14:03  
1     0:12:05                    136     01/01/19     0:42:11  
2     0:13:15                    115     01/01/19     0:37:47  
3     0:13:36                    115     01/01/19     0:38:04  
4     0:15:30                     35     01/01/19     0:22:47  

In[]:
series['fecha_solicitud']=series[['Fecha_Retiro','Hora_Retiro']].agg(' '.join,axis=1)
series['fecha_entrega']=series[['Fecha_Arribo','Hora_Arribo']].agg(' '.join,axis=1)

series['fecha_solicitud']=pandas.to_datetime(series['fecha_solicitud'],infer_datetime_format=True)

series['fecha_entrega']=pandas.to_datetime(series['fecha_entrega'],infer_datetime_format=True)

series=series.drop('Fecha_Arribo',axis=1)
series=series.drop('Hora_Arribo',axis=1)
series=series.drop('Fecha_Retiro',axis=1)
series=series.drop('Hora_Retiro',axis=1)

first_column = series.pop('fecha_entrega')
series.insert(0, 'fecha_entrega', first_column)

first_column = series.pop('fecha_solicitud')
series.insert(0, 'fecha_solicitud', first_column)

print(series.head())

Out[]:

fecha_solicitud       fecha_entrega Genero_Usuario  Edad_Usuario   Bici  \
0 2019-01-01 00:08:44 2019-01-01 00:14:03              M            36  11444   
1 2019-01-01 00:12:05 2019-01-01 00:42:11              M            36   9196   
2 2019-01-01 00:13:15 2019-01-01 00:37:47              M            38   2612   
3 2019-01-01 00:13:36 2019-01-01 00:38:04              M            43   7463   
4 2019-01-01 00:15:30 2019-01-01 00:22:47              M            37   3987   

   Ciclo_Estacion_Retiro  Ciclo_Estacion_Arribo  
0                     67                     36  
1                     55                    136  
2                     29                    115  
3                     29                    115  
4                    136                     35  


In[]:
repetidos=series['Ciclo_Estacion_Retiro'].value_counts()
repetidos10=DataFrame(repetidos.head(10))
repetidos10

Out[]:

Ciclo_Estacion_Retiro
271 7442
27  6533
1   6370
182 5044
36  4493
64  4453
43  4402
18  4363
47  4299
266 4221

In[]:

repetidos10=repetidos10.reset_index()
repetidos10.columns=['Estacion','Frecuencia']
repetidos10

Out[]:

    Estacion    Frecuencia
0   271 7442
1   27  6533
2   1   6370
3   182 5044
4   36  4493
5   64  4453
6   43  4402
7   18  4363
8   47  4299
9   266 4221

repetidos10.Estacion[0]=271 # repetidos10 is a DataFrame with the 10 stations with the highest demand

In[]:

origen_Test=pandas.DataFrame(series.loc[(series['Ciclo_Estacion_Retiro'] ==repetidos10.Estacion[0])])

origen_Test

Out[]:
fecha_solicitud fecha_entrega   Genero_Usuario  Edad_Usuario    Bici    Ciclo_Estacion_Retiro   Ciclo_Estacion_Arribo
107 2019-01-01 07:31:34 2019-01-01 07:38:09 M   53  8064    271 26
117 2019-01-01 07:42:57 2019-01-01 07:47:05 M   63  7825    271 270
124 2019-01-01 07:45:03 2019-01-01 07:55:09 M   44  3936    271 107
137 2019-01-01 07:57:03 2019-01-01 08:01:21 M   39  10213   271 10
157 2019-01-01 08:14:09 2019-01-01 08:26:28 M   41  10734   271 46
... ... ... ... ... ... ... ...
695937  2019-01-31 21:29:27 2019-01-31 21:40:18 M   49  1849    271 26
696011  2019-01-31 21:33:05 2019-01-31 21:57:46 F   29  12373   271 74
696467  2019-01-31 22:05:15 2019-01-31 22:13:31 M   38  9491    271 113
696929  2019-01-31 22:52:38 2019-01-31 23:02:47 M   28  7914    271 174
696957  2019-01-31 22:57:00 2019-01-31 23:25:39 M   37  10974   271 271
7442 rows × 7 columns

I want to group them every 30 minutes with the Pandas resample function, but when doing so it adds months that I don't have, that is, it changes the date...

origen_30min_Test=origen_Test.resample('30min',on='fecha_solicitud').Ciclo_Estacion_Arribo.count()

print(origen_30min_Test)

fecha_solicitud
2019-01-01 07:30:00    4
2019-01-01 08:00:00    2
2019-01-01 08:30:00    0
2019-01-01 09:00:00    0
2019-01-01 09:30:00    1
                      ..
2019-12-01 21:00:00    1
2019-12-01 21:30:00    1
2019-12-01 22:00:00    0
2019-12-01 22:30:00    0
2019-12-01 23:00:00    2
Freq: 30T, Name: Ciclo_Estacion_Arribo, Length: 16064, dtype: int64

in the second photo is the result of executing the resample function 30min

I have tried several things ... but they do not work for me, can you help me please

These pictures do not demonstrate the problem - Can you provide sample data that reproduces the problem? Mainly, your dataframe `origen_Test` could very easily contain dates from Dec 2019 in the "hidden" rows that are not displayed in the output and that could explain the result. — topsail, Jul 30 '22 at 19:48
Please provide enough code so others can better understand or reproduce the problem. — Community, Jul 30 '22 at 21:11
series is the main DataFrame (with all data for January 2019 and all stations). origin_Test is a DataFrame that contains the data filtered by Ciclo_Estacion_Retiro = 271 (It is the station with the highest demand). I Filtered, confirmed in Excel, and double-checked that the data -in the original .CSV file- only contains data from January 2019. — JorgeChiles, Jul 31 '22 at 03:25
when you say "hidden" what do you mean? or how can I know if there are hidden rows in DataFrame? help me, please. — JorgeChiles, Jul 31 '22 at 03:27
I modified the code, if it helps you and you can help me.. thanks — JorgeChiles, Jul 31 '22 at 04:00

score 0 · Answer 1 · answered Jul 31 '22 at 05:33

I plot the original data at the beginning and only data from January 2019 appears. it's OK

the "error" appears after using the resample function after this line...

series['fecha_solicitud']=pandas.to_datetime(series['fecha_solicitud'],infer_datetime_format=True)

I had to use infer_datetime_format=True because the original Hour data only has one digit, that is:

01/01/19     0:14:03

and I couldn't change it to (throwing an error)

01/01/19     00:14:03

Do you know how I can add the 0 to the HOUR data of the time, I think with this I can solve it. thanks

Pandas resample function adds dates that were not in the Dataframe

1 Answers1