7

Question + MWE

How can I drop/remove multiple ranges of rows from a Pandas DataFrame with a (two level) multi index looking like this:

idx1    idx2  |  value(s)   ...
------------------------------------------
4       0     |  1.123456   ...
        1     |  2.234567   ...
        2     |  0.012345   ...
8       0     | -1.123456   ...
        1     | -0.973915   ...
        2     |  1.285553   ...
        3     | -0.194625   ...
        4     | -0.144112   ...
...     ...   | ...         ...

The ranges to drop/remove are currently residing in a list like this:

ranges = [[(4, 1), (4, 2)],          # range (4,1):(4,2)
          [(8, 0), (8, 3)],          # range (8,0):(8,3)
          [(8, 5), (8, 10)], ...]    # range (8,5):(8,10)

The main problem is, that most methods I found, don't support either multi indexing or slicing or multiple slices/ranges.

What is the best/fastest way to do so.

Current ugly solution

for range in ranges:
    df = df.drop(df.loc[range[0]:range[1]].index)

Is slow and ugly, but is the only workable solution I found, combining multi indexing, slicing and in a way multiple ranges (by looping over the ranges).

Solution comparison

All three proposed solutions work. They all solve the issue by transforming the list of slices into a list of all individual tuples within those slice.

Slices to complete set of tuples

The fastest way to do so is @ALollz solution:

idx = [(x, z) for (x, i), (_, j) in ranges for z in np.arange(i,j+1,1)]

Performance

Regarding the removal of the rows, all solutions work, but there's a big difference in performance (all performance data based on my data set with ~10 Mio. entries)

  1. @ALollz + @Ben. T's combined solution (~19 sec.)

    df.drop(pd.MultiIndex.from_tuples(idx))
    

    or without creating a MultiIndex object

    df.drop(idx)
    
  2. @ALollz first solution (~75 sec.)

    df.loc[list(set(df.index.values) - set(idx))]
    
  3. @user3471881's solution (~95 sec.)

    df.loc[~df.index.isin(ranges)]
    
  4. my ugly solution (~350 sec.)

    see above
    
Ichixgo
  • 259
  • 3
  • 11
  • Would using `loc` work? like [this question](https://stackoverflow.com/questions/24435788/using-loc-with-a-multiindex-in-pandas) – MattR Nov 19 '18 at 16:37
  • 1
    It works, it's part of my currently working ugly solution, but it only works for one range at a time like this: `df.drop(df.loc[(4,1):(4,2)].index)` I haven't found a way to get multiple ranges into it and it's also hella slow... am also gonna add my current ugly solution to the post. – Ichixgo Nov 19 '18 at 16:41
  • 1
    Have you lloked into `pd.slice` or, maybe even better, [indexslice](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.IndexSlice.html)? – G. Anderson Nov 19 '18 at 16:45
  • Yeah, they just give a nicer interface to slicing and some functions require slices to be in the form of `(4, slice(1,2))` instead of `(4,1):(4,2)` but they didn't enable me to circumvent the multiple ranges problem. – Ichixgo Nov 19 '18 at 16:54
  • How do you construct your range list and does it need to contain lists of tuples (pairs that become a sliceable "range") or did you design it that way in order to use `loc`? – user3471881 Nov 19 '18 at 17:30
  • It doesn't necessarily has to be a list of tuples, but it was the most straight forward solution so far. The way I get the ranges is, by looking for a certain value of a certain variable. Every occurrence of this var inside an idx1 marks the beginning or end of a range. So I just loop the whole df once and mark all the points where a range begins or ends. – Ichixgo Nov 19 '18 at 17:38
  • Sorry maybe I was unclear, does it need to be a list of pairs that combined become a range or could it be `[(8,1), (8,2), (8,3), (8,4), (8,5)]`? – user3471881 Nov 19 '18 at 17:40
  • Yeah I get your point. It's just that when going through the df I'm just marking the end and beginning of every range. I could also track all the points in between. It just seemed unnecessary so far. – Ichixgo Nov 19 '18 at 17:42

2 Answers2

2

You could create a new list of indices and as Ben.T points out, just drop them.

import numpy as np
import pandas as pd

idx = [(x, z) for (x, i), (_, j) in ranges for z in np.arange(i,j+1,1)]
df.drop(pd.MultiIndex.from_tuples(idx))

Output:

           value(s)
idx1 idx2          
4    0            4
8    4           11
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • For me, your solution removes everything except the ranges, I want to remove. So it essentially does the opposite of what I wanted to do. – Ichixgo Nov 19 '18 at 17:03
  • 1
    you can also use `drop` with your `idx` such as `df.drop(pd.MultiIndex.from_tuples(idx))` – Ben.T Nov 19 '18 at 17:29
  • 2
    Thanks guys! Creating the complete index list instead of the ranges made the trick :). I tried all three solutions: your first one @ALollz, yours + @Ben.T s and the one below proposed by @user3471881. They all do the trick, based on transforming the slices into a complete tuple list. Performance wise you can see a comparison up top :)! Thanks again. Btw. your combined solution was the fastest and you can also just call `df.drop(idx)` it's in the same speed range as your solution. – Ichixgo Nov 19 '18 at 18:03
1

The range list you are using forces us to use multiple slices, which could be fine but doesn't seem to be what you want.

If you instead fill your list with all indexes that you want removed (you said in a comment that you could do this):

ranges = [(4, 1), (4, 2), (8, 0), (8, 1), (8, 2), (8, 3) ... ]

You could just access the index of the DataFrame and check if it isin() your list of tuples.

df.index.isin(ranges)

To remove the indexes that are in your list of ranges, add a tilde and then use as mask.

df[~df.index.isin(ranges)]
user3471881
  • 2,614
  • 3
  • 18
  • 34