Is there a way to store data-points instead of creating multi-dimensional array?

Question

I am trying to read data available and write it to a NetCDF file. Say, I am reading temperature along different time, depth, latitude and longitude values, I will have to create a whole 4D grid of time, depth, latitude and longitude as dimensions.

However, the data I am recording has values at very few points. For example, in one of the cases, I had data at 155 points, while the grid was of 50x16x16x18 along time, depth latitude and longitude respectively. Thus I had data in only 155 points out of a grid having 230400 cells. Rest all points had fill values.

It seemed quite useless to have so many fill values. Is it possible to write a legitimate netCDF file with only the points that had data or maybe fewer use of fill values?

I am using NetCDF Java library for the process.

Thank you so much in advance.

You could create a central object that stores the points then have multiple associative means of accessing the same central object, just create references between the various access objects and the central reference object, there is no limit and all will point to the same data giving you multiple ways of getting to the same locations. — SPlatten, Jan 16 '19 at 07:44

Stephen C · Answer 1 · 2019-01-17T11:31:45.743

Any N-dimensional sparse array can be represented as a list (or 1-D array) of tuples, where each tuple has N coordinate value, and one data value.

If the array is sufficiently sparse, the list-based representation occupies less space ... on disk, and in memory.

Now the simple list-based representation is NOT good for random access because you need to scan through list to access the value at any point in the original array. You can improve on this (in the in-memory version):

If you order the list based on the coordinates and use an ArrayList, you can perform a binary search to find the value for a set of coordinates. This gives O(log N) indexing, with no additional memory overhead.
If you use a HashMap<Coords, Value>, you can get O(1) lookup. However, this comes at a significant addition memory cost. Probably around 50 to 80 additional bytes per entry compared to using an ArrayList representation.

I was about to write a similar answer. Just to add that if you don't want to search for a specific point though, and simply want to carry out an operation at each of the points where you have data, then the list representation will be fine. Moreover you save not only space on disk, but also in programme memory by not having to handle a very sparse 4D array — ClimateUnboxed, Jan 17 '19 at 08:58

score 1 · Answer 2 · answered Jan 16 '19 at 16:15

It should be possible to represent the data at each grid point using one of the discrete sampling geometries (DSG) outlined by the CF Conventions (here are some examples). Perhaps one of these representations would work for your case (maybe timeSeries or timeSeriesProfile)? The DSGs are often talked about in the context of observational data, but they should apply to sub-sampled model output as well.

Is there a way to store data-points instead of creating multi-dimensional array?

2 Answers2