2

I'm working on two dataframes df1 and df2. I used the code :

df1.index.searchsorted(df2.index)

But I'm not sure about how does it work. Could someone please explain me how ?

Snowfire777
  • 203
  • 3
  • 10

1 Answers1

9

The method applies a binary search to the index. This is a well-known algorithm that uses the fact that values are already in sorted order to find an insertion index in as few steps as possible.

Binary search works by picking the middle element of the values, then comparing that to the searched-for value; if the value is lower than that middle element, you then narrow your search to the first half, or you look at the second half if it is larger.

This way you reduce the number of steps needed to find your element to at most the log of length of the index. For 1000 elements, that's fewer than 7 steps, for a million elements, fewer than 14, etc.

The insertion index is the place to add your value to keep the index in sorted order; the left location also happens to be the index of a matching value, so you can also use this both to find places to insert missing or duplicate values, and to test if a given value is present in the index.

The pandas implementation is basically the numpy.sortedsearch() function, which uses generated C code to optimise this search for different object types, squeezing out every last drop of speed.

Pandas uses the method in various index implementations to ensure fast operations. You usually wouldn't use this method to test if a value is present in the index, for example, because Pandas indexes already implement an efficient __contains__ method for you, usually based on searchsorted() where that makes sense. See DateTimeEngine.__contains__() for such an example.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Is `index.get_loc` smart enough to use searchsorted or some other log-time method when `index` is sorted? – JoseOrtiz3 Mar 26 '19 at 19:57
  • 1
    @OrangeSherbet: yes, `index.get_loc` will use `searchsorted` where appropriate. The [`DateTimeEngine.get_loc` implementation](https://github.com/pandas-dev/pandas/blob/v0.23.0.dev0/pandas/_libs/index.pyx#L443-L482) does just that, for example. – Martijn Pieters Mar 26 '19 at 20:21