1

Apologies if this is a basic question, I'm new to these tools.

I have a netcdf file with with eight variables containing data from the same source, but in different time periods. There is no overlap between the variables across the time dimension. How do I combine all 8 variables into one "CHIRPS_p_d" variable that does not contain missing numbers?

print(ds)
<xarray.Dataset>
Dimensions:      (time: 14244, cluster: 5548)
Coordinates:
  * time         (time) datetime64[ns] 1981-01-01 1981-01-02 ... 2019-12-31
  * cluster      (cluster) object 'Ethiopia 1' 'Ethiopia 2' ... 'Uganda 619'
Data variables:
    lat          (time, cluster) float64 3.456 3.55 3.864 3.983 ... nan nan nan
    lon          (time, cluster) float64 39.52 39.18 39.1 38.49 ... nan nan nan
    CHIRPS_p_d1  (time, cluster) float64 0.0 0.0 0.0 0.0 0.0 ... nan nan nan nan
    CHIRPS_p_d2  (time, cluster) float64 nan nan nan nan nan ... nan nan nan nan
    CHIRPS_p_d3  (time, cluster) float64 nan nan nan nan nan ... nan nan nan nan
    CHIRPS_p_d4  (time, cluster) float64 nan nan nan nan nan ... nan nan nan nan
    CHIRPS_p_d5  (time, cluster) float64 nan nan nan nan nan ... nan nan nan nan
    CHIRPS_p_d6  (time, cluster) float64 nan nan nan nan nan ... nan nan nan nan
    CHIRPS_p_d7  (time, cluster) float64 nan nan nan nan nan ... nan nan nan nan
    CHIRPS_p_d8  (time, cluster) float64 nan nan nan nan nan ... 0.0 0.0 0.0 0.0

Right now my data looks like this:

>>> print(df.sample(5))
                         CHIRPS_p_d1  CHIRPS_p_d2       lat       lon  CHIRPS_p_d3  CHIRPS_p_d4  CHIRPS_p_d5  CHIRPS_p_d6  CHIRPS_p_d7  CHIRPS_p_d8
time       cluster
2014-10-16 Tanzania 265          NaN          NaN  -8.83643  39.47150          NaN          NaN          NaN          NaN          0.0          NaN
2018-02-28 Mali 122              NaN          NaN  12.12839  -4.68048          NaN          NaN          NaN          NaN          NaN          0.0
1999-10-26 Tanzania 77           NaN          NaN -10.72684  39.50261          NaN          0.0          NaN          NaN          NaN          NaN
1985-08-17 Nigeria 504           NaN     0.000000   9.09914   7.27965          NaN          NaN          NaN          NaN          NaN          NaN
1986-08-02 Niger 181             NaN     0.672992  15.38926   5.25865          NaN          NaN          NaN          NaN          NaN          NaN

Ideally, I want to obtain something like this

                         CHIRPS_p_d      lat       lon  
time       cluster
2014-10-16 Tanzania 265      0.0       -8.83643   39.47150
2018-02-28 Mali 122          0.0        12.12839  -4.68048
1999-10-26 Tanzania 77       0.0       -10.72684  39.50261 
1985-08-17 Nigeria 504       0.0         9.09914   7.27965
1986-08-02 Niger 181         0.672992   15.38926   5.25865

  • Welcome to SO. What do you mean by "collapse"? Do you want to average them? – Robert Wilson Nov 18 '22 at 13:18
  • Thank you. Broadly, I want to combine them into one new variable, where each value will be equal to the only non-missing value found in one of the 8 variables. Averaging should work, I think? – Dansmabentz Nov 18 '22 at 14:06
  • Please clarify this in the question. Also, please ensure that you know what you are asking before asking – Robert Wilson Nov 18 '22 at 14:07

1 Answers1

0

You can replace NaN values by 0 and then add all CHIRPS_p_d variables. Since there is only one CHIRPS_p_d having a non-NaN value by time step this should do the trick:

ds.fillna(0.0)
result = ds["CHIRPS_p_d1"] + ... + ds["CHIRPS_p_d8"]
Louis Lac
  • 5,298
  • 1
  • 21
  • 36