Calculate mean from only one variable in pandas dataframe and netcdf

Question

I am aiming to calculate daily climatology from a dataset, i.e. obtain the sea surface temperature (SST) for each day of the year by averaging all the years (for example, for January 1st, the average SST of all January 1st from 1982 to 2018). To do so, I made the following steps:

DATA PREPARATION STEPS

Here is a Drive link to both datasets to make the code reproducible: link to datasets

First, I load two datasets:

ds1 = xr.open_dataset('./anomaly_dss/archive_to2018.nc') #from 1982 to 2018
ds2 = xr.open_dataset('./anomaly_dss/realtime_from2018.nc') #from 2018 to present

Then I convert to pandas dataframe and merge both in one:

ds1 = ds1.where(ds1.time > np.datetime64('1982-01-01'), drop=True) # Grab all data since 1/1/1982
ds2 = ds2.where(ds2.time > ds1.time.max(), drop=True) # Grab all data since the end of the archive

# Convert to Pandas Dataframe
df1 = ds1.to_dataframe().reset_index().set_index('time')
df2 = ds2.to_dataframe().reset_index().set_index('time')

# Merge these datasets
df = df1.combine_first(df2)

So far, this is how my dataframe looks like:

NOTE THAT THE LAT,LON GOES FROM LAT(35,37.7), LON(-10,-5), THIS MUST REMAIN LIKE THAT

ANOMALY CALCULATION STEPS

# Anomaly claculation
def standardize(x):
    return (x - x.mean())/x.std()

# Calculate a daily average
df_daily = df.resample('1D').mean()

# Calculate the anomaly for each yearday
df_daily['anomaly'] = df_daily['analysed_sst'].groupby(df_daily.index.dayofyear).transform(standardize)

I obtain the following dataframe:

As you can see, I obtain the mean values of all three variables.

QUESTION

As I want to plot the climatology data on a map, I DO NOT want lat/lon variables to be averaged to one point. I need the anomaly from all the points lat/lon points, and I don't really know how to achieve that.

Any help would be very appreciated!!

score 1 · Answer 1 · answered Jan 11 '22 at 13:51

1

I think you can do all that in a simpler and more straightforward way without converting your dataarray to a dataframe:

import os

#Will open and combine automatically the 2 datasets
DS = xr.open_mfdataset(os.path.join('./anomaly_dss', '*.nc'))

da = DS.analysed_sst

#Resampling
da = da.resample(time = '1D').mean()

# Anomaly calculation
def standardize(x):
    return (x - x.mean())/x.std()

da_anomaly = da.groupby(da.time.dt.dayofyear).apply(standardize)

Then you can plot the anomaly for any day with:

da_anomaly[da_anomaly.dayofyear == 1].plot()

answered Jan 11 '22 at 13:51

Thrasy

536
3
9

Thanks for the answer @Thrasy , however, I am getting the following error messagee: ''Coordinate variable time is neither monotonically increasing nor monotonically decreasing on all datasets''. Any ideas? – OlegRuskiy Jan 12 '22 at 12:55
@OlegRuskiy Its difficult to say without access to your datasets, you may have to reindex. – Thrasy Jan 13 '22 at 10:34
I have provided the drive link to the datasets right below data preparation steps :) – OlegRuskiy Jan 13 '22 at 11:32
OK. I don't know exactly why you get this exception but there are least two issues here: 1/ both netCDF use different calendars (and the calendar used by the first file is not standard), 2/ the spatial grids used by both files are different. Before combining the two files, either as dataarray or dataframe, you will have to reindex them so that their coordinate systems are coherent. Anyway, it's far beyond the scope of your initial question. – Thrasy Jan 13 '22 at 21:13
Oh, okay, I didn't notice the calendar and grid issues, I will work on that first. Thanks @Thrasy !! – OlegRuskiy Jan 14 '22 at 10:38

Calculate mean from only one variable in pandas dataframe and netcdf

1 Answers1