Merging multiple observational nc files based on station attributes

Question

I am trying to merge multiple nc files containing physical oceanographic data for different depths at different latitudes and longitudes. I am using ds = xr.open_mfdataset to do this, but the files are not merging correctly and when I try to plot them it seems there is only one resulting value for the merged files. This is the code I am using:

##Combining using concat_dim and nested method
ds = xr.open_mfdataset("33HQ20150809*.nc", concat_dim=['latitude'], combine= "nested")
ds.to_netcdf('geotraces2015_combined.nc')
df = xr.open_dataset("geotraces2015_combined.nc")

##Setting up values. Oxygen values are transposed so it matches same shape as lat and pressure. 
oxygen = df['oxygen'].values.transpose()
##Plotting using colourf
fig = plt.figure()
ax = fig.add_subplot(111)
plt.contourf(oxygen, cmap = 'inferno')
plt.gca().invert_yaxis()
cbar = plt.colorbar(label = 'Oxygen Concentration (umol kg-1')

You can download the nc files from here under CTD https://cchdo.ucsd.edu/cruise/33HQ20150809

This is how each file looks like:

<xarray.Dataset>
Dimensions:         (pressure: 744, time: 1, latitude: 1, longitude: 1)
Coordinates:
  * pressure        (pressure) float64 0.0 1.0 2.0 3.0 ... 741.0 742.0 743.0
  * time            (time) datetime64[ns] 2015-08-12T18:13:00
  * latitude        (latitude) float32 60.25
  * longitude       (longitude) float32 -179.1
Data variables: (12/19)
    pressure_QC     (pressure) int16 ...
    temperature     (pressure) float64 ...
    temperature_QC  (pressure) int16 ...
    salinity        (pressure) float64 ...
    salinity_QC     (pressure) int16 ...
    oxygen          (pressure) float64 ...
    ...              ...
    CTDNOBS         (pressure) float64 ...
    CTDETIME        (pressure) float64 ...
    woce_date       (time) int32 ...
    woce_time       (time) int16 ...
    station         |S40 ...
    cast            |S40 ...
Attributes:
    EXPOCODE:                   33HQ20150809
    Conventions:                COARDS/WOCE
    WOCE_VERSION:               3.0
...

Another file would look like this:

<xarray.Dataset>
Dimensions:         (pressure: 179, time: 1, latitude: 1, longitude: 1)
Coordinates:
  * pressure        (pressure) float64 0.0 1.0 2.0 3.0 ... 176.0 177.0 178.0
  * time            (time) datetime64[ns] 2015-08-18T19:18:00
  * latitude        (latitude) float32 73.99
  * longitude       (longitude) float32 -168.8
Data variables: (12/19)
    pressure_QC     (pressure) int16 ...
    temperature     (pressure) float64 ...
    temperature_QC  (pressure) int16 ...
    salinity        (pressure) float64 ...
    salinity_QC     (pressure) int16 ...
    oxygen          (pressure) float64 ...
    ...              ...
    CTDNOBS         (pressure) float64 ...
    CTDETIME        (pressure) float64 ...
    woce_date       (time) int32 ...
    woce_time       (time) int16 ...
    station         |S40 ...
    cast            |S40 ...
Attributes:
    EXPOCODE:                   33HQ20150809
    Conventions:                COARDS/WOCE
    WOCE_VERSION:               3.0

EDIT: This is my new approach which is still not working: I'm trying to use preprocess to set_coords, squeeze, and expand_dims following Michael's approch:

def preprocess(ds):
return ds.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
ds = xr.open_mfdataset('33HQ20150809*.nc', concat_dim='station', combine='nested', preprocess=preprocess)

But I'm still having the same problem...

Solution: First, I had to identify the coordinate with the unique value, in my case was 'station'. Then I used preprocess to apply the squeeze and set_coords and expand_dims functions to each file, following Michael's answers.

import pandas as pd
import numpy as np
import os
import netCDF4 
import pathlib
import matplotlib.pyplot as plt

def preprocess(ds):
    return ds.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')

ds = xr.open_mfdataset('filename*.nc', preprocess=preprocess, parallel=True)
ds = ds.sortby('latitude').transpose()

ds.oxygen.plot.contourf(x="latitude", y="pressure")
plt.gca().invert_yaxis()

can you open the files one by one using `xr.open_dataset` and check if they're aligned along all dimensions except latitude, with `xr.align(list_of_datasets, join='exact', exclude='latitude')`? it's hard to debug the merge without knowing what the data looks like before and after :/ — Michael Delgado, Aug 12 '22 at 00:09
Oh - and if your data needs to be joined in both latitude and longitude, either provide the structure explicitly with nested lists or use `combine='by_coords'` and skip the concat dim argument — Michael Delgado, Aug 12 '22 at 15:59
If I use combine='by_coords' it crashes the kernel. The dataset contains 4 coordinates but I would like the merging to be on latitude and pressure, but it also does not allows me to do this. — Maria Cristina Alvarez, Aug 12 '22 at 17:18
There are 106 files to merge, so I tried with only fours. When I do the "list_of_datasets",ds1 = xr.open_dataset('33HQ20150809_00001_00002_ctd.nc') ds2 = xr.open_dataset('33HQ20150809_00001_00005_ctd.nc') ds3 = xr.open_dataset('33HQ20150809_00001_00007_ctd.nc') ds4 = xr.open_dataset('33HQ20150809_00002_00004_ctd.nc') list_of_datasets = (ds1, ds2, ds3, ds4) xr.align(list_of_datasets, join='exact', exclude='latitude') I got AttributeError: 'tuple' object has no attribute 'copy' — Maria Cristina Alvarez, Aug 12 '22 at 17:19
oh sorry - should be `xr.align(*list_of_datasets, ...)` with the asterisk to expand the list into positional arguments — Michael Delgado, Aug 12 '22 at 17:32
I got the following error: ValueError: cannot align objects with join='exact' where index/labels/sizes are not equal along these coordinates (dimensions): 'pressure' ('pressure',) — Maria Cristina Alvarez, Aug 12 '22 at 18:38
well there you go :) they're not aligned along the pressure dimension. you need to make sure the data are exactly aligned if they're going to be joined automatically. I can't really help more without having a lot more info about the datasets. But maybe something along these lines is tripping you up? https://stackoverflow.com/questions/69866469/subtract-two-xarrays-while-keeping-all-dimensions/69867005#69867005 — Michael Delgado, Aug 12 '22 at 18:40
I see, all nc files will have different pressure dimensions since is data from different stations at different depths, some stations are 300 m depth and other 4000 meters depth. I thought I could merge by latitude and every point would keep its own pressure. — Maria Cristina Alvarez, Aug 12 '22 at 18:55
are the latitudes the same in each file? in that case, you'd want concat dim to be `pressure`, not `latitude` — Michael Delgado, Aug 12 '22 at 19:14
it would be really helpful if you showed us what your data looks like. just `print(ds)` and paste that as a code block. debugging through comments is super tough :) — Michael Delgado, Aug 12 '22 at 19:15
is this observational data, where each dataset has a bunch of different variables with dimension (`pressure, time`) but pressure differs by dataset, and lat/lon are just metadata about the station? I really need more information about what *each* dataset looks like, and how you expect them to come together. can you show a couple examples of these files, without concatenating them, but just what is returned by `xr.open_dataset`? — Michael Delgado, Aug 12 '22 at 21:38
I have added how one individual file looks like. Each file will have a different size of pressure with different latitude, longitude and time. My goal with this merging is to plot a vertical section for latitude in the x-axis and pressure in the y-axis using contourf to see how oxygen values change along different latitudes and different pressure. — Maria Cristina Alvarez, Aug 12 '22 at 21:47
whoa. thanks for the examples. yeah this is a very non-standard use of "dimensions" for a netcdf. are the pressure values actually just a positional index, from 0 to the number of observations? or are those observational data as well, and they're not always integer values? — Michael Delgado, Aug 12 '22 at 21:53
Yes I'm trying to get this done for a research cruise so I can apply it to newly collected data from CTD. Pressure is a positional index so its value won't change, only the size depending of each file. — Maria Cristina Alvarez, Aug 12 '22 at 22:40
thanks for walking me through all of this! it's an unusual structure so it took me a while to understand. I took a stab at clarifying the title so others might find it better - would appreciate it if you let me know whether the answer helps :) — Michael Delgado, Aug 13 '22 at 00:41
Okey I understand where was my problem, now I have to figure out a weay to import each file without merging them so I can apply the squeeze function. — Maria Cristina Alvarez, Aug 15 '22 at 16:29
the other option is to use the preprocess argument to [`open_mfdataset`](https://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html) — Michael Delgado, Aug 15 '22 at 22:06

Michael Delgado · Answer 1 · 2022-08-13T00:37:07.763

The xarray data model requires that all data dimensions be perpendicular and complete. In other words, every combination of every coordinate along each dimension will be present in the data array (either as data or NaNs).

You can work with observational data such as yours using xarray, but you have to be careful with the indices to ensure you don't explode the data dimensionality. Specifically, whenever data is not truly a dimension of the data, but is simply an observation or attribute tied to a station or monitor, you should think of this more as a data variable than a coordinate. In your case, your dimensions seem to be station ID and pressure level (which does not have a full set of observations for each station, but is a dimension of the data). On the other hand, time, latitude, and longitude are attributes of each station, and should not be treated as dimensions.

I'll generate some random data that looks like yours:

def generate_random_station():
    station_id = "{:09d}".format(np.random.randint(0, int(1e9)))
    time = np.random.choice(pd.date_range("2015-08-01", "2015-08-31", freq="H"))
    plevs = np.arange(np.random.randint(1, 1000)).astype(float)
    lat = np.random.random() * 10 + 30
    lon = np.random.random() * 10 - 80

    ds = xr.Dataset(
        {
            "salinity": (('pressure', ), np.sin(plevs / 200 + lat)),
            "woce_date": (("time", ), [time]),
            "station": station_id,
        },
        coords={
            "pressure": plevs,
            "time": [time],
            "latitude": [lat],
            "longitude": [lon],
        },
    )

    return ds

This ends up looking like the following:

In [11]: single = generate_random_station()

In [12]: single
Out[12]:
<xarray.Dataset>
Dimensions:    (pressure: 37, time: 1, latitude: 1, longitude: 1)
Coordinates:
  * pressure   (pressure) float64 0.0 1.0 2.0 3.0 4.0 ... 33.0 34.0 35.0 36.0
  * time       (time) datetime64[ns] 2015-08-21T01:00:00
  * latitude   (latitude) float64 39.61
  * longitude  (longitude) float64 -72.19
Data variables:
    salinity   (pressure) float64 0.9427 0.941 0.9393 ... 0.8726 0.8702 0.8677
    woce_date  (time) datetime64[ns] 2015-08-21T01:00:00
    station    <U9 '233136181'

The problem is the latitude, longitude, and time coords aren't really dimensions which can be used to index a larger array. They aren't evenly spaced, and each combination of lat/lon/time does not have a station at it. Because of this, we need to be extra careful to make sure that when we combine the data, the lat/lon/time dimensions are not expanded.

To do this, we'll squeeze these dimensions, and expand the datasets along a new dimension, station:

In [13]: single.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
Out[13]:
<xarray.Dataset>
Dimensions:    (pressure: 37, station: 1)
Coordinates:
  * station    (station) <U9 '233136181'
  * pressure   (pressure) float64 0.0 1.0 2.0 3.0 4.0 ... 33.0 34.0 35.0 36.0
    time       datetime64[ns] 2015-08-21T01:00:00
    latitude   float64 39.61
    longitude  float64 -72.19
Data variables:
    salinity   (station, pressure) float64 0.9427 0.941 0.9393 ... 0.8702 0.8677
    woce_date  (station) datetime64[ns] 2015-08-21T01:00:00

This can be done to all of your datasets, then they can be concatenated along the "station" dimension:

In [14]: all_stations = xr.concat(
   ...:     [
   ...:         generate_random_station()
   ...:         .set_coords('station')
   ...:         .squeeze(["latitude", "longitude", "time"])
   ...:         .expand_dims('station')
   ...:         for i in range(10)
   ...:     ],
   ...:     dim="station",
   ...: )

This results in a dataset indexed by pressure level and station:

In [15]: all_stations
Out[15]:
<xarray.Dataset>
Dimensions:    (pressure: 657, station: 10)
Coordinates:
  * pressure   (pressure) float64 0.0 1.0 2.0 3.0 ... 653.0 654.0 655.0 656.0
  * station    (station) <U9 '197171488' '089978445' ... '107555081' '597650083'
    time       (station) datetime64[ns] 2015-08-19T06:00:00 ... 2015-08-24T15...
    latitude   (station) float64 37.96 34.3 38.74 39.28 ... 37.72 33.89 36.46
    longitude  (station) float64 -74.28 -73.59 -78.33 ... -76.6 -76.47 -77.96
Data variables:
    salinity   (station, pressure) float64 0.2593 0.2642 0.269 ... 0.8916 0.8893
    woce_date  (station) datetime64[ns] 2015-08-19T06:00:00 ... 2015-08-24T15...

You can now plot along the latitude and pressure level dimensions:

In [16]: all_stations.salinity.plot.contourf(x="latitude", y="pressure")

I tried applying this code import os import glob folder = "/Users/mariacristinaalvarez/Documents/Documents/Projects/GEOTRACES2015" for filename in glob.glob(os.path.join(folder,'*.nc')): data = xr.open_dataset(filename) all_sttions = xr.concat([data.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')], dim='station'), but the resulting dataset only contains data from one file of the 106 files — Maria Cristina Alvarez, Aug 15 '22 at 21:00
did you check to see if your list of files has more than one file in it? — Michael Delgado, Aug 15 '22 at 22:04
also - is station different on each file? I was assuming that was a unique station ID, but if that's not varying by file then you should concat using some other unique identifier. again - I can't help more without a more complete [mre]. — Michael Delgado, Aug 15 '22 at 22:06
I have inspected the files individually to make sure the station id are unique. I will add a minimal reproducible example, hopefully it makes sense. I have being self-learning python for oceanographic data analysis, so these formats are new for me. — Maria Cristina Alvarez, Aug 17 '22 at 16:41
In the new coding, the plot looks so much better than before, It still missing the lowest oxygen values, so it's like the plotting is slicing the values. When I inspect the minimum values for the array it says is 26, However the plot only reaches 200. — Maria Cristina Alvarez, Aug 17 '22 at 16:45
totally understand re: learning, and this is a tough one! it can be hard for us to help if we don't have the full picture because there are *so many* ways things can cause errors. but feel free to keep at it! and glad it's looking better. I don't understand the rest of your comment though :/ feel free to ask another question if your new question is more about plotting than concatenating — Michael Delgado, Aug 17 '22 at 16:49
I added the image of the plot and how the oxygen value variable looks like. I'm not sure this is a matter of plotting or still a matter of concatenating. When I inspected each file there are definitely values above 400. So I'm inclining more towards a concat issue. — Maria Cristina Alvarez, Aug 17 '22 at 18:05
Actually I think it might be a plotting issue. Haha! I have been plotting each single file and comparing values to make sure it makes sense and it kinda does. Since it's a huge range (4000) it is difficult to see the minimun O2 zones, but I will manually set up ranges to see desired depths — Maria Cristina Alvarez, Aug 17 '22 at 21:55
I have reedited my question with the solution. Thanks Michael for your constant support, this has been a nightmare for 3 weeks and I have finally found an answer! — Maria Cristina Alvarez, Aug 17 '22 at 21:58

Merging multiple observational nc files based on station attributes

1 Answers1