How to reduce the number of duplicated lines in a xarray?

Question

I would like to remove duplicated line from this xarray:

<xarray.QFDataArray (dates: 61, tickers: 4, fields: 6)>
array([[[ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan]],
       [[ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ]],
       ...,
       [[    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan]],
       [[    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan]]])
Coordinates:
  * tickers  (tickers) object BloombergTicker:0000630D US Equity ... BloombergTicker:0000630D US Equity
  * fields   (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
  * dates    (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30

In the example above, the ticker is duplicated 4 times. My goal is to obtain an output that looks kind of the following:

<xarray.QFDataArray (dates: 61, tickers: 1, fields: 6)>
array([[[ 4.9167,     nan, ...,  2.1695,     nan],
       [ 5.    ,     nan, ...,  2.1333, 70.02  ],
       ...,
       [    nan,     nan, ...,     nan,     nan],
       [    nan,     nan, ...,     nan,     nan]]])
Coordinates:
  * tickers  (tickers) object BloombergTicker:0000630D US Equity
  * fields   (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
  * dates    (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30

Note that the field "tickers" was reduced from 4 to 1.

Here is the code (libraries importation not included) :

def _get_historical_data_cache():
    path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'cached_values_v2_clean.cache')
    data = cached_value(_get_historical_data_bloomberg, path) # data importation from cache memory, if not available, directly from a data provider
    return data

def _slice_by_ticker():
    tickers = _get_historical_data_cache().indexes['tickers']

    for k in tickers:
        slice = _get_historical_data_cache().loc[:, k, :]  # it gives me duplicated tickers.

From the data provider, I get a 3D data array (xarray) with the following dimension: dates, tickers, and fields. The goal is to "slice" this cube, plan by plan, in my case, ticker by ticker, in order to get, in each iteration, a 2D data array (or 3D xarray like shown above as desired output) that would represent each ticker with its corresponding data (dates and fields).

Here is how the xarray looks like in the first iteration (like shown above). The problem is that the unique ticker is duplicated:

In[2]: slice
Out[2]: 
<xarray.QFDataArray (dates: 61, tickers: 4, fields: 6)>
array([[[ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan]],
       [[ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ]],
       ...,
       [[    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan]],
       [[    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan]]])
Coordinates:
  * tickers  (tickers) object BloombergTicker:0000630D US Equity ... BloombergTicker:0000630D US Equity
  * fields   (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
  * dates    (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30

When I try the solution proposed by Ryan, here is the code :

def _slice_by_ticker():
    tickers = _get_historical_data_cache().indexes['tickers']

    for k in tickers:
        slice = _get_historical_data_cache().loc[:, k, :]  # it gives me duplicated tickers.

        # get unique ticker values as numpy array
        unique_tickers = np.unique(slice.tickers.values)
        da_reindexed = slice.reindex(tickers=unique_tickers)

And here is the error :

ValueError: cannot reindex or align along dimension 'tickers' because the index has duplicate values

Thanks for your help ! :)

score 1 · Answer 1 · answered Mar 17 '19 at 01:40

1

It sounds like you want to reindex your dataarray. (See xarray docs on reindexing.)

Below I will assume that da is the name of the original dataarray

import numpy as np
# get unique ticker values as numpy array
unique_tickers = np.unique(da.tickers.values)
da_reindexed = da.reindex(tickers=unique_tickers)

answered Mar 17 '19 at 01:40

Ryan

766
6
13

1

Thank you for your answer. Unfortunately, I get an error, which is the following: "ValueError: cannot reindex or align along dimension 'tickers' because the index has duplicate values" – user10393289 Mar 18 '19 at 08:47

score 0 · Answer 2 · answered Mar 18 '19 at 09:36

Answer found.

First I tried this:

slice_clean = (slice[:, :1]).rename('slice_clean')
slice.reindex_like(slice_clean)

This gave me the same error as shown above:

ValueError: cannot reindex or align along dimension 'tickers' because the index has duplicate values

Then, I tried just this:

slice = slice[:,:1]

And it worked!

<xarray.QFDataArray (dates: 61, tickers: 1, fields: 6)>
array([[[ 4.9167,     nan, ...,  2.1695,     nan]],

       [[ 5.    ,     nan, ...,  2.1333, 70.02  ]],

       ...,

       [[    nan,     nan, ...,     nan,     nan]],

       [[    nan,     nan, ...,     nan,     nan]]])
Coordinates:
  * tickers  (tickers) object BloombergTicker:0000630D US Equity
  * fields   (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
  * dates    (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30

How to reduce the number of duplicated lines in a xarray?

2 Answers2