51

Suppose I have

>>> v
array([1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 3, 4, 3, 4, 3, 4, 5, 5, 5])

Is there an efficient numpy way to find each index where the value changes? For instance, I would want some result like,

>>> index_of_changed_values(v)
[0, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16]

If this is not possible with some numpy routine, what is a fast way to do it in python? It would also be useful to me to be referred to some good numpy tutorials since I am a numpy beginner.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Aaron
  • 2,344
  • 3
  • 26
  • 32

4 Answers4

89

You can get this functionality in numpy by comparing each element with it's neighbor;

v[:-1] != v[1:]


array([False, False, False, False,  True, False, False,  True,  True,
    True,  True,  True,  True,  True,  True,  True, False, False], dtype=bool)

to get the indices you use the "where" function

np.where(v[:-1] != v[1:])[0]

array([ 4,  7,  8,  9, 10, 11, 12, 13, 14, 15])

From here you can prepend the first element and add a one to get to the same indexing scheme you have in your question.

kith
  • 5,486
  • 1
  • 21
  • 21
15

Similar to @kith answer, but requires less massaging of the result:

np.where(np.roll(v,1)!=v)[0]

No need to prepend 0 or add 1. Example:

>>> v=np.array([1, 1, 1, 2, 2, 3, 3, 4, 4, 4])
>>> np.where(np.roll(v,1)!=v)[0]
array([0, 3, 5, 7])

EDIT: as @Praveen mentioned, this fails when the last and the first elements are equal.

  • 3
    This won't work if the array looks something like `[1, 1, 1, 2, 2, 2, 1, 1, 1]`. i.e., the first and last values are the same, so you won't get the index '0' like you expect... – Praveen Oct 12 '16 at 07:56
  • @Praveen It worked for my example: `('(Hashimotos)', 2150, 2161) : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]` Òutput `[0, 1, 11]` – PinkBanter Nov 20 '22 at 16:10
7

Almost ten years later, but I came across this one today.

@kith answer is good, but may not be as neat as we want (also taking into account the steps not explicit in the answer).

that answer in the complete form would be,

v = np.array([1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 3, 4, 3, 4, 3, 4, 5, 5, 5])
np.concatenate((np.array([0]),np.where(v[:-1] != v[1:])[0]+1),axis=0)

An alternative I like more is,

np.where(np.diff(v,prepend=np.nan))[0]

which also returns

array([ 0,  5,  8,  9, 10, 11, 12, 13, 14, 15, 16], dtype=int64)

As I said, the idea is the same as @kith's but,

  • I replace v[:-1] != v[1:] for np.diff(), then in np.where the array is casted to boolean, this does not change much but seems neater.
  • I removed the extra step of adding 1 and prepending 0. This is done by prepending np.nan before doing np.diff(). The first element of the diff output will then be np.nan, and in python np.nan always evaluates True.
myradio
  • 1,703
  • 1
  • 15
  • 25
  • 2
    hi, OP here. This is giving me some serious nostalgia for my first developer role in industry. I found out a few days ago that the team I was on at the time dissolved. Thank you guys for your answers over the years. – Aaron Jan 11 '21 at 20:06
  • 1
    I'm glad it helped in any way ^_^. Stay safe. – myradio Jan 12 '21 at 07:55
  • 1
    You must have NumPy version 1.16.0 or newer installed for `np.where(np.diff(v,prepend=np.nan))[0]` to work. I'm posting for those that don't know. – David M. Helmuth Mar 14 '21 at 08:08
  • @DavidM.Helmuth Good point, I honestly hadn't checked it. I will add it to the answer. – myradio Mar 14 '21 at 08:12
5

Great question and answers!

I am working with a vector with about 1 million monotonically non-decreasing integers running from 1 to 100,000 (e.g. [1, 1, 1, 2, 3, 3, 4, ..., 100000]). For this data set, it seems there's a noticeable performance difference between the 2 idioms discussed above and whether or not the prepend kwarg is used:

%timeit np.where(np.diff(v, prepend=np.nan))                                                                                                                             
15.3 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.where(np.diff(v))[0] + 1                                                                                                                                     
7.41 ms ± 72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.where(v[:-1] != v[1:])[0] + 1                                                                                                                                       
2.85 ms ± 41.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the fancy-indexing invocation is 5 times faster compared to using diff() with the prepend kwarg and over twice as fast as using diff with no prepend (on my ancient MacBook Air anyway). For most use cases this performance difference isn't going to matter but I'm working with thousands of data sets like this (billions of rows total) so I need to keep performance in mind.

Karl Stanley
  • 51
  • 1
  • 4