0

I have tried to resample my values to hour. However, since I have changed the format of the date in csv file because of automatic swapping of months and days with low numbers (2003-04-01 is suddenly 2003-01-04). Now the date format is fine (when showing the csv file in Python) but while using resample, the values appear in NaN values.

df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')

`hour_avg = df_2.resample('H').mean()` 

Sample of my data: Raw data with time as index

Afterwards: even when time is datetime it shows 99% of the data as NaN values (one value per day is shown) Data with NaN values after resample per hours

When I used resample for day values, all values are back. So it seems there is a problem with the Time. When I use the format at the beginning, the error "The format doesn't fit" comes up. I tried a different way before (not sure what was different) but resample worked per hour.

What do I need to change to be able to use resample for hour again?

NadineH33
  • 1
  • 1

1 Answers1

0

Can you share a sample of your data? Assuming that your data consists of a DateTime feature (i.e. yyyy-mm-dd hh-mm-ss) and some other features that you are trying to resample by hour, NaN values can occur due to two reasons: incorrect formatting by Pandas or missing hour values in data.

(1) It is possible that pandas is not reading your dates correctly. Once you read the file, make sure the date column is in the right format (i.e. yyyy-mm-dd).

df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S')

(2) If there are any gaps in your data, NaN values will pop up. For instance, assume the data is of this form:

2000-01-01 00:00:00    1
2000-01-01 00:01:00    1
2000-01-01 00:03:00    1
2000-01-01 00:04:00    1
2000-01-01 00:06:00    1

If you try hour_avg = df_2.resample('H').mean(), your output will look like:

2000-01-01 00:00:00    1
2000-01-01 00:01:00    1
2000-01-01 00:02:00    NaN
2000-01-01 00:03:00    1
2000-01-01 00:04:00    1
2000-01-01 00:05:00    NaN
2000-01-01 00:06:00    1

I suspect the problem is the latter. If it is the latter, you can simply remove the NaN values using df_2.dropna(). Otherwise, if you do need the hourly bins regardless of missing data, you can avoid the NaN values by padding the missing values first and then attempting to get the mean:

hour_pad = df_2.resample('H').pad()
hour_avg = hour_pad.resample('H').mean()
Akanksha Atrey
  • 780
  • 4
  • 8
  • Thank you for your quick response! The weird thing is that I don't have NaN values before applying resample. My data has no NaN values. It seems like that the mean cannot be made and so NaN values appear. I also realised that only my date column is in datetime index and not the time column. I tried to format both but all suggestions found in the web didn't work. Do you know what to try next? – NadineH33 Aug 19 '20 at 07:57
  • Can you post a sample of your data along with a sample output (where NaNs pop up)? First thing would be to make sure your date column is in the right format (following suggestion (1) in the post) and your other columns are in a format where the mean could be taken (float, int, etc). You can type `print(df_2.dtypes)` to check the column types. Once that is sorted, I would recommend looking at where the NaN values pop up after resample. There may not be NaNs in your original data but your data could be "missing" certain hours (i.e. no rows for 02 hour) and thus resample would put NaNs for those. – Akanksha Atrey Aug 19 '20 at 19:17
  • I eddited my question with two pics (before and after) =) – NadineH33 Aug 21 '20 at 20:32