Out of bounds error when unstacking MultiIndex pandas dataframe after filtering

Question

I have a multi-index pandas DataFrame that I perform some operations to (including dropping columns with null values) and then try to unstack... however this results in an index error. Any way to fix this? Code below:

ds = ds.unstack(level='Symbol')
ds.columns = ds.columns.swaplevel(0, 1)
ds = ds[start:end]
ds = ds[equities]
ds = ds.stack(level='Symbol')
ds.dropna(axis=1, inplace=True) # this line breaks the code
ds = ds.unstack(level='Symbol')
ds.head()

Without the dropna line the code performs fine, so something about this is breaking the indexing... which seems like a bug to me. This throws an error with some data frames but not all so probably specific to only some circumstances. Any help would be greatly appreciated!

Dumping error log below:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-341-efb7e680485a> in <module>()
      9 #ds.dropna(axis=1, inplace=True)
     10 
---> 11 ds = ds.unstack(level='Symbol')
     12 
     13 ds.head()

~/.local/lib/python3.5/site-packages/pandas/core/frame.py in unstack(self, level, fill_value)
   4567         """
   4568         from pandas.core.reshape.reshape import unstack
-> 4569         return unstack(self, level, fill_value)
   4570 
   4571     _shared_docs['melt'] = ("""

~/.local/lib/python3.5/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
    467     if isinstance(obj, DataFrame):
    468         if isinstance(obj.index, MultiIndex):
--> 469             return _unstack_frame(obj, level, fill_value=fill_value)
    470         else:
    471             return obj.T.stack(dropna=False)

~/.local/lib/python3.5/site-packages/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
    480         unstacker = partial(_Unstacker, index=obj.index,
    481                             level=level, fill_value=fill_value)
--> 482         blocks = obj._data.unstack(unstacker)
    483         klass = type(obj)
    484         return klass(blocks)

~/.local/lib/python3.5/site-packages/pandas/core/internals.py in unstack(self, unstacker_func)
   4349         new_columns = new_columns[columns_mask]
   4350 
-> 4351         bm = BlockManager(new_blocks, [new_columns, new_index])
   4352         return bm
   4353 

~/.local/lib/python3.5/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
   3035         self._consolidate_check()
   3036 
-> 3037         self._rebuild_blknos_and_blklocs()
   3038 
   3039     def make_empty(self, axes=None):

~/.local/lib/python3.5/site-packages/pandas/core/internals.py in _rebuild_blknos_and_blklocs(self)
   3123         for blkno, blk in enumerate(self.blocks):
   3124             rl = blk.mgr_locs
-> 3125             new_blknos[rl.indexer] = blkno
   3126             new_blklocs[rl.indexer] = np.arange(len(rl))
   3127 

IndexError: index 100352 is out of bounds for axis 1 with size 100352

Unfortunately it is. I have been trying to come up with an example case to illustrate the issue been able to emulate one without identifying the underlying cause of the issue yet :( — J. Vasquez, Apr 02 '18 at 13:53
Definitely not a Series, it's a DataFrame object (). I'm dropping columns with any null value as a pre-processing step for my machine learning models. — J. Vasquez, Apr 02 '18 at 14:18
What about alternative solution instead `dropna` ? `ds = ds[ds.notnull().all(axis=1)]` ? — jezrael, Apr 02 '18 at 14:38
Worth a try but unfortunately no difference. Still causes an IndexError in the following unstack. — J. Vasquez, Apr 02 '18 at 14:53
Hmm, 2 ideas -What is your pandas version? Is possible some `NaN`s in MultiIndex? — jezrael, Apr 02 '18 at 15:00
OK, multiindex is in columns or in index? What return `print (ds.info())` before dropna? — jezrael, Apr 02 '18 at 15:06
Right, so after stacking the DataFrame has a MultiIndex (index) of Date and Company, for each Date and Company there are several features (columns) that tell me about the performance of said company on that date. Missing information can either be missing for all companies or a single company for a given date. Not very experienced with multi-index/multi-column so not sure how these two might cause issues for unstack(). — J. Vasquez, Apr 02 '18 at 15:07
ds.info() output: MultiIndex: 1636544 entries, (2002-01-01 00:00:00, 1436513D) to (2015-12-31 00:00:00, ZION) Columns: 224 entries, ADV$_21D to px_volume dtypes: float64(223), object(1) memory usage: 2.7+ GB — J. Vasquez, Apr 02 '18 at 15:10
Another idea after `dropna` use `df.index = df.index.remove_unused_levels()` — jezrael, Apr 02 '18 at 15:15
AttributeError: 'DatetimeIndex' object has no attribute 'remove_unused_levels' — J. Vasquez, Apr 02 '18 at 15:35
Sorry, did something stupid... copy and pasted your code (df is something else). Will try again.. — J. Vasquez, Apr 02 '18 at 15:44
That did the trick! Could you maybe elucidate a bit why remove_unused_levels() helps things after dropna for me? — J. Vasquez, Apr 02 '18 at 15:48

score 4 · Accepted Answer · answered Apr 02 '18 at 15:54

4

Problem is dropna remove some rows, so also values of MultiIndex, but MultiIndex is not changed by default. So need removed this unnecessary values from MultiIndex by MultiIndex.remove_unused_levels.

ds = ds.stack(level='Symbol')
ds.dropna(axis=1, inplace=True)

ds.index = ds.index.remove_unused_levels()

ds = ds.unstack(level='Symbol')

answered Apr 02 '18 at 15:54

jezrael

822,522
95
1,334
1,252

This works great, thanks for your answer and explanation (and patience)! – J. Vasquez Apr 02 '18 at 15:57
@J.Vasquez - Ya, without data not easy, but it is really good it working for you. Nice day! – jezrael Apr 02 '18 at 15:57

Out of bounds error when unstacking MultiIndex pandas dataframe after filtering

1 Answers1