I would like to remove duplicated line from this xarray:
<xarray.QFDataArray (dates: 61, tickers: 4, fields: 6)>
array([[[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan]],
[[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ]],
...,
[[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
[[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]]])
Coordinates:
* tickers (tickers) object BloombergTicker:0000630D US Equity ... BloombergTicker:0000630D US Equity
* fields (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
* dates (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30
In the example above, the ticker is duplicated 4 times. My goal is to obtain an output that looks kind of the following:
<xarray.QFDataArray (dates: 61, tickers: 1, fields: 6)>
array([[[ 4.9167, nan, ..., 2.1695, nan],
[ 5. , nan, ..., 2.1333, 70.02 ],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]]])
Coordinates:
* tickers (tickers) object BloombergTicker:0000630D US Equity
* fields (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
* dates (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30
Note that the field "tickers" was reduced from 4 to 1.
Here is the code (libraries importation not included) :
def _get_historical_data_cache():
path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'cached_values_v2_clean.cache')
data = cached_value(_get_historical_data_bloomberg, path) # data importation from cache memory, if not available, directly from a data provider
return data
def _slice_by_ticker():
tickers = _get_historical_data_cache().indexes['tickers']
for k in tickers:
slice = _get_historical_data_cache().loc[:, k, :] # it gives me duplicated tickers.
From the data provider, I get a 3D data array (xarray) with the following dimension: dates, tickers, and fields. The goal is to "slice" this cube, plan by plan, in my case, ticker by ticker, in order to get, in each iteration, a 2D data array (or 3D xarray like shown above as desired output) that would represent each ticker with its corresponding data (dates and fields).
Here is how the xarray looks like in the first iteration (like shown above). The problem is that the unique ticker is duplicated:
In[2]: slice
Out[2]:
<xarray.QFDataArray (dates: 61, tickers: 4, fields: 6)>
array([[[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan]],
[[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ]],
...,
[[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
[[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]]])
Coordinates:
* tickers (tickers) object BloombergTicker:0000630D US Equity ... BloombergTicker:0000630D US Equity
* fields (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
* dates (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30
When I try the solution proposed by Ryan, here is the code :
def _slice_by_ticker():
tickers = _get_historical_data_cache().indexes['tickers']
for k in tickers:
slice = _get_historical_data_cache().loc[:, k, :] # it gives me duplicated tickers.
# get unique ticker values as numpy array
unique_tickers = np.unique(slice.tickers.values)
da_reindexed = slice.reindex(tickers=unique_tickers)
And here is the error :
ValueError: cannot reindex or align along dimension 'tickers' because the index has duplicate values
Thanks for your help ! :)