How to load multiple csv-files into xarray Dataset and concat along multiple dimensions?

Question

There is a similar question to mine, but the data has a different structure and I run into errors. I have multiple .dat files, that contain tables for different arbitrary times t=1,3,9,10,12, etc. The tables in the different .dat files have the same columns M_star, M_planet, separation, and M_star can be viewed as an index in steps of 0.5. Nevertheless, the length of the tables and the values of M_star vary from file to file, e.g. for time t=1 I have

M_star M_planet separation
10.0   0.022    7.11
10.5   0.019    2.30
11.0   0.008    14.01

while for t=3 I have

M_star M_planet separation
9.5    0.308    1.32
10.0   0.522    4.18
10.5   0.019    3.40
11.0   0.338    0.91
11.5   0.150    1.20

What I would like to do is to load all the .dat files into an xarray DataSet (at least I think this would be useful), so that I can access data in the columns M_planet and separation by providing precise values for t and M_star, e.g. I would like to do something like ds.sel(t=9, M_star=10.5)['M_planet'] to get the value of M_planet at the given t and M_star coordinates. What I have tried sofar unsuccessfully is:

fnames = glob('table_t=*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,names=['M_star', 'M_planet', 'separation'], skiprows=1)

# first I load all the tables into a list of dataframes
dfs = [pd.read_csv(fname,**kw) for fname in fnames]
# then I include as a column to each dataframe the time, all the t-entries are same within a dataframe
dfs2= [df_i.assign(t=t) for df_i, z in zip(dfs, [1,2,3,4,9,10,12])]
# I try to make an array DataSet, but I run into an error
d = xr.concat([df_i.to_xarray() for df_i in df_s2], dim='t')

The last line throws an error: t already exists as coordinate or variable name.

How can I load my .dat files into xarray and make t and M_star the dimensions/coordinates? Tnx

If each file has different values of M_star, I don’t think you want it to be a coordinate. Xarray requires the array dimensions to be orthogonal; so creating this array would expand the data to include every combination of t and M_Star. Pandas seems pretty appropriate for this to me, though I don’t know much about your workflow. The narrow issue you’re facing is that t must be a coordinate, not a data variable, on each concatenated dataset, so you should use assign_coords after converting to xarray rather than df.assign. — Michael Delgado, Sep 01 '22 at 05:48
@MichaelDelgado thank you, I will try to implement your suggestion about assign_coord. The `M_star` column is actually not completely arbitrary, as I stated before, it goes in steps of 0.5, but the number of steps varies in the different files. I corrected my question. Do you see if it might be working with this new information? — NeStack, Sep 01 '22 at 07:47
Ok - in that case you’ll still end up with NaNs in your results but maybe you’re fine with it — Michael Delgado, Sep 01 '22 at 14:11

score 1 · Accepted Answer · answered Sep 01 '22 at 14:27

The problem is occurring because you are assigning t as a column in the dataframes, which are converted to data variables in the xarray datasets (indexed only by M_star) so the t values are interpreted as conflicts during the merge.

Additionally, since you’re combining along both M_star and t, you should use xr.combine_by_coords rather than concat, which only works along one dimension. See the merging and combining data docs for an overview of the different options.

You can fix this by making sure t becomes a dimension/coordinate before merging. You could assign it as a dimension right away by adding it to the pandas index rather than the columns:

dfs2 = [
    df_i.assign(t=t).set_index('t', append=True)
    for df_i, z in zip(dfs, [1,2,3,4,9,10,12])
]

Alternatively you could move the t coordinate assignment into xarray:

d = xr.combine_by_coords(
    [
        df_i.to_xarray().expand_dims(t=[z])
        for df_i, z in zip(df1, [1,2,3,4,9,10,12]))
    ],
)

Thanks you! I had now the time to test your solution and it does what I want, when I alter you second suggestion to `df_i.set_index('M_star').to_xarray()......` :) Hence, I accept your answer as a solution! — NeStack, Sep 02 '22 at 09:25

score 0 · Answer 2 · answered Sep 02 '22 at 09:47

Using Michael Delgado's comments a solution to my problem can be coded that way:

fnames = glob('table_t=*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,names=['M_star', 'M_planet', 'separation'], skiprows=1)

# first I load all the tables into a list of dataframes
dfs = [pd.read_csv(fname,**kw) for fname in fnames]

# set_index('M_star') turns M_star from a df-column into an index, as Michael said this is necessary
# expand_dims(t=[t]) turns t into a dim/coordinate of the DataSet, which I also want
d = xr.combine_by_coords(
    [
        df_i.set_index('M_star').to_xarray().expand_dims(t=[t])
        for df_i, t in zip(dfs, [1,2,3,4,9,10,12])
    ],
)

With this I have the DataSet d in the form that I wanted, both t and M_star being my coordinates/dimentions, see below (the naming is different):

This allows me to do what I wanted - access values in the DataSet based on me providing precise values both along M_star and t:

print(float(d.sel(t=9, Log10M_h=11.5)['M_planet'].values))
>>> 0.019

But as Michael stated, I can also get an alternative solution by using only a pandas df instead of an array DataSet. For that I concatenate all the dataframes into one long one and I assign and additional column t to keep track of this value, I actually don't need t to be an index. This is how the alternative using exclusively pandas would look like:

# in the 3 lines below we create a df with all the data files concatenated
df_s = [pd.read_csv(fname,**kw) for fname in fnames]
df_s2= [df_i.assign(t=t) for df_i, t in zip(df_s, [1,2,3,4,9,10,12])]
df = pd.concat(df_s2).reset_index(drop=True)

print(df[(df.t==3) & (df.M_star==10.5)]['M_planet'].values[0])
>>> 0.171

How to load multiple csv-files into xarray Dataset and concat along multiple dimensions?

2 Answers2