2

I am converting a text file to netCDF format using xarray. When I am using netCDF4 format and Python3, it is storing string variables as strings but when I use Python2 it stores them as n-dimensional character arrays. I have tried to set dtype='str' in encoding and that didn't make any difference. Is there a way to make these variables to have string data-type using Python2? Any thoughts would be appreciated.

Here is my code:

import pandas as pd
import xarray as xr

column_names = ['timestamp', 'air_temp', 'vtempdiff', 'rh', 'pressure', 'wind_dir', 'wind_spd']

df = pd.read_csv(args.input_file, skiprows = 1, header=None, names = column_names)
ds = xr.Dataset.from_dataframe(df)

encoding = {'timestamp': {'dtype': 'str'},
            'air_temp': {'_FillValue': 9.96921e+36, 'dtype': 'f4'}
            }

ds.to_netcdf(op_file.nc, format = 'NETCDF4', unlimited_dims={'time':True}, encoding = encoding)

When I do ncdump of the op_file.nc using Python3.6, I get:

netcdf op_file {
dimensions:
    time = UNLIMITED ; // (24 currently)
variables:
    string timestamp(time) ;
    float air_temp(time) ;
    .
    .
    .

And when I use Python2.7, I get:

netcdf op_file {
dimensions:
    time = UNLIMITED ; // (24 currently)
    string20 = 20 ;
variables:
    char timestamp(time, string20) ;
        timestamp:_Encoding = "utf-8" ;
    float air_temp(time) ;
    .
    .
    .

The sample input file looks like this:

# Fields: stamp,AGO-4.air_temp,AGO-4.vtempdiff,AGO-4.rh,AGO-4.pressure,AGO-4.wind_dir,AGO-4.wind_spd
2016-11-30T00:00:00Z,-36.50,,56.00,624.60,269.00,5.80
2016-11-30T01:00:00Z,-35.70,,55.80,624.70,265.00,5.90
sainiak
  • 99
  • 8
  • That's a ton of irrelevant code. Give a simple toy example that shows your issue with minimal data size. Probably a single value is enough in this case, not half a dozen columns. – John Zwinck Feb 24 '18 at 08:18

1 Answers1

5

Xarray maps Python 2's str/bytes type to NetCDF's NC_CHAR type. Both these types represent single byte character data (generally ASCII) so this makes a certain amount of sense.

To get a netCDF string NC_STRING, you need to pass pass unicode data (str on Python 3). You can get this by explicitly coercing your timestamp column to unicode, either with .astype(unicode) or by passing {'dtype': unicode} in encoding.

shoyer
  • 9,165
  • 1
  • 37
  • 55
  • Can the underlying Numpy array represent a proper `str`? In my experiments, Numpy casts Python3's `str`s into, e.g., `dtype=' – Ahmed Fasih Apr 01 '18 at 00:48
  • NumPy supports two types of arrays: "object" arrays storing references to arbitrary Python objects, and arrays with fixed-sized data types (everything else). Neither of these is ideal for Python strings, which have variable size but are not just completely arbitrary Python objects. In practice, libraries like pandas and xarray do often use object arrays for strings because it's the best we can do. – shoyer Apr 01 '18 at 21:16