3

I have a multi-index pandas DataFrame that I perform some operations to (including dropping columns with null values) and then try to unstack... however this results in an index error. Any way to fix this? Code below:

ds = ds.unstack(level='Symbol')
ds.columns = ds.columns.swaplevel(0, 1)
ds = ds[start:end]
ds = ds[equities]
ds = ds.stack(level='Symbol')
ds.dropna(axis=1, inplace=True) # this line breaks the code
ds = ds.unstack(level='Symbol')
ds.head()

Without the dropna line the code performs fine, so something about this is breaking the indexing... which seems like a bug to me. This throws an error with some data frames but not all so probably specific to only some circumstances. Any help would be greatly appreciated!

Dumping error log below:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-341-efb7e680485a> in <module>()
      9 #ds.dropna(axis=1, inplace=True)
     10 
---> 11 ds = ds.unstack(level='Symbol')
     12 
     13 ds.head()

~/.local/lib/python3.5/site-packages/pandas/core/frame.py in unstack(self, level, fill_value)
   4567         """
   4568         from pandas.core.reshape.reshape import unstack
-> 4569         return unstack(self, level, fill_value)
   4570 
   4571     _shared_docs['melt'] = ("""

~/.local/lib/python3.5/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
    467     if isinstance(obj, DataFrame):
    468         if isinstance(obj.index, MultiIndex):
--> 469             return _unstack_frame(obj, level, fill_value=fill_value)
    470         else:
    471             return obj.T.stack(dropna=False)

~/.local/lib/python3.5/site-packages/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
    480         unstacker = partial(_Unstacker, index=obj.index,
    481                             level=level, fill_value=fill_value)
--> 482         blocks = obj._data.unstack(unstacker)
    483         klass = type(obj)
    484         return klass(blocks)

~/.local/lib/python3.5/site-packages/pandas/core/internals.py in unstack(self, unstacker_func)
   4349         new_columns = new_columns[columns_mask]
   4350 
-> 4351         bm = BlockManager(new_blocks, [new_columns, new_index])
   4352         return bm
   4353 

~/.local/lib/python3.5/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
   3035         self._consolidate_check()
   3036 
-> 3037         self._rebuild_blknos_and_blklocs()
   3038 
   3039     def make_empty(self, axes=None):

~/.local/lib/python3.5/site-packages/pandas/core/internals.py in _rebuild_blknos_and_blklocs(self)
   3123         for blkno, blk in enumerate(self.blocks):
   3124             rl = blk.mgr_locs
-> 3125             new_blknos[rl.indexer] = blkno
   3126             new_blklocs[rl.indexer] = np.arange(len(rl))
   3127 

IndexError: index 100352 is out of bounds for axis 1 with size 100352
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
J. Vasquez
  • 161
  • 8
  • 1
    Are data confidental? Withoud data hard to find problem. – jezrael Apr 02 '18 at 13:41
  • Unfortunately it is. I have been trying to come up with an example case to illustrate the issue been able to emulate one without identifying the underlying cause of the issue yet :( – J. Vasquez Apr 02 '18 at 13:53
  • OK, what is `print (type(ds))` before `dropna` ? – jezrael Apr 02 '18 at 13:54
  • Because if Series, try `ds.dropna(inplace=True)` – jezrael Apr 02 '18 at 13:55
  • Definitely not a Series, it's a DataFrame object (). I'm dropping columns with any null value as a pre-processing step for my machine learning models. – J. Vasquez Apr 02 '18 at 14:18
  • What about alternative solution instead `dropna` ? `ds = ds[ds.notnull().all(axis=1)]` ? – jezrael Apr 02 '18 at 14:38
  • Worth a try but unfortunately no difference. Still causes an IndexError in the following unstack. – J. Vasquez Apr 02 '18 at 14:53
  • Hmm, 2 ideas -What is your pandas version? Is possible some `NaN`s in MultiIndex? – jezrael Apr 02 '18 at 15:00
  • Pandas version 0.22.0 – J. Vasquez Apr 02 '18 at 15:04
  • OK, multiindex is in columns or in index? What return `print (ds.info())` before dropna? – jezrael Apr 02 '18 at 15:06
  • Right, so after stacking the DataFrame has a MultiIndex (index) of Date and Company, for each Date and Company there are several features (columns) that tell me about the performance of said company on that date. Missing information can either be missing for all companies or a single company for a given date. Not very experienced with multi-index/multi-column so not sure how these two might cause issues for unstack(). – J. Vasquez Apr 02 '18 at 15:07
  • ds.info() output: MultiIndex: 1636544 entries, (2002-01-01 00:00:00, 1436513D) to (2015-12-31 00:00:00, ZION) Columns: 224 entries, ADV$_21D to px_volume dtypes: float64(223), object(1) memory usage: 2.7+ GB – J. Vasquez Apr 02 '18 at 15:10
  • Another idea after `dropna` use `df.index = df.index.remove_unused_levels()` – jezrael Apr 02 '18 at 15:15
  • AttributeError: 'DatetimeIndex' object has no attribute 'remove_unused_levels' – J. Vasquez Apr 02 '18 at 15:35
  • There is multiindex? – jezrael Apr 02 '18 at 15:37
  • Sorry, did something stupid... copy and pasted your code (df is something else). Will try again.. – J. Vasquez Apr 02 '18 at 15:44
  • That did the trick! Could you maybe elucidate a bit why remove_unused_levels() helps things after dropna for me? – J. Vasquez Apr 02 '18 at 15:48
  • Supeeerrrr :) I create answer. – jezrael Apr 02 '18 at 15:50

1 Answers1

4

Problem is dropna remove some rows, so also values of MultiIndex, but MultiIndex is not changed by default. So need removed this unnecessary values from MultiIndex by MultiIndex.remove_unused_levels.

ds = ds.stack(level='Symbol')
ds.dropna(axis=1, inplace=True)

ds.index = ds.index.remove_unused_levels()

ds = ds.unstack(level='Symbol')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252