Unexpected performance curve from CPython merge sort

Question

I have implemented a naive merge sorting algorithm in Python. Algorithm and test code is below:

import time
import random
import matplotlib.pyplot as plt
import math
from collections import deque

def sort(unsorted):
    if len(unsorted) <= 1:
        return unsorted
    to_merge = deque(deque([elem]) for elem in unsorted)
    while len(to_merge) > 1:
        left = to_merge.popleft()
        right = to_merge.popleft()
        to_merge.append(merge(left, right))
    return to_merge.pop()

def merge(left, right):
    result = deque()
    while left or right:
        if left and right:
            elem = left.popleft() if left[0] > right[0] else right.popleft()
        elif not left and right:
            elem = right.popleft()
        elif not right and left:
            elem = left.popleft()
        result.append(elem)
    return result

LOOP_COUNT = 100
START_N = 1
END_N = 1000

def test(fun, test_data):
    start = time.clock()
    for _ in xrange(LOOP_COUNT):
        fun(test_data)
    return  time.clock() - start

def run_test():
    timings, elem_nums = [], []
    test_data = random.sample(xrange(100000), END_N)
    for i in xrange(START_N, END_N):
        loop_test_data = test_data[:i]
        elapsed = test(sort, loop_test_data)
        timings.append(elapsed)
        elem_nums.append(len(loop_test_data))
        print "%f s --- %d elems" % (elapsed, len(loop_test_data))
    plt.plot(elem_nums, timings)
    plt.show()

run_test()

As much as I can see everything is OK and I should get a nice N*logN curve as a result. But the picture differs a bit:

Things I've tried to investigate the issue:

PyPy. The curve is ok.
Disabled the GC using the gc module. Wrong guess. Debug output showed that it doesn't even run until the end of the test.
Memory profiling using meliae - nothing special or suspicious. ` I had another implementation (a recursive one using the same merge function), it acts the similar way. The more full test cycles I create - the more "jumps" there are in the curve.

So how can this behaviour be explained and - hopefully - fixed?

UPD: changed lists to collections.deque

UPD2: added the full test code

UPD3: I use Python 2.7.1 on a Ubuntu 11.04 OS, using a quad-core 2Hz notebook. I tried to turn of most of all other processes: the number of spikes went down but at least one of them was still there.

You're using `.pop(0)` on a list. While I'm not sure if this is the reason for this particular runtime problem, it's extremely suboptimal: lists are implemented as arrays in CPython and if you remove the first element, the whole thing has to be shifted (it's an `O(n)` operation). You should pop from the end or use a linked list like `collections.deque` — Niklas B., Apr 03 '12 at 16:45
You are looking at extremely few elements. To get a useful estimation of the asymptotic runtime, you'll need bigger numbers. — Sven Marnach, Apr 03 '12 at 16:47
@NiklasB. I think this comment could be an answer to this question — jsbueno, Apr 03 '12 at 16:54
@jsbueno: I know, but I don't have the means to prove it :) I'd like to see a proper explanation of this (especially regarding the anomaly in the range from 200 to 350), so if anyones feels like extending on this, go ahead! — Niklas B., Apr 03 '12 at 16:57
@NiklasB. Ok, one more thing to learn about Python: a list is not a list, a list is an array. :) Will fix. — Vlad, Apr 03 '12 at 17:01
@SvenMarnach. The numbers on the plot are somewhat fake. Every test data list is N*10 length and each data set was sorted a 100 times to get a single timing. — Vlad, Apr 03 '12 at 17:02
@vkazanov: "list" is a general term describing certain types of collections: arrays, singly linked lists, doubly linked lists, ... But you are right in that Python lists are not linked lists, as one could assume. — Niklas B., Apr 03 '12 at 17:04
@NiklasB. i changed all the lists to deques (actually, in a minute a realized, that lists still could be used :) ). The sorting became faster, but the "jumps" are still there: [jumps](http://s8.postimage.org/o8ysrafat/deque_long_run_2.png) — Vlad, Apr 03 '12 at 17:29
@vkazanov: Please include the new code, I will then edit in the new image for you. Also, it would be helpful if you included the test code you use, so we don't have to rewrite it for ourselves. — Niklas B., Apr 03 '12 at 17:30
@NiklasB., et. al., I tried to reproduce these jumps, and I couldn't. — senderle, Apr 03 '12 at 18:11
@NiklasB. I also cannot reproduce these jumps, with either the original list-based code or with the deque-based code. — max, Apr 03 '12 at 18:17
Also, for what it's worth, my test code was different; I wrote my own before I saw that vkazanov had posted his test code. — senderle, Apr 03 '12 at 18:19
By the way why the heck did I get better performance for list than for deque? Are deques so much slower than lists that they perform worse even in the very case for which they are designed - removing the first element? — max, Apr 03 '12 at 18:55
btw, your `sort()` produces descending order (`sort([3,-1, 4, 0, 5]) == deque([5, 4, 3, 0, -1])`) — jfs, Apr 03 '12 at 22:05
could you run the test several times using the same input array? Are the spikes always in the same place? (you could use `pickle.dump/load` to save/restore `test_data`). — jfs, Apr 04 '12 at 09:05
could you set CPU affinity of the python process (to steak it to a single core) e.g., `taskset 1 python yourscript.py`. It shouldn't be necessary but just in case. — jfs, Apr 04 '12 at 09:21
@J.F.Sebastian I've set the affinity mask, ran the code multiple times using the same data (no pickling - within the same process). The spikes were around all the time, at different places. — Vlad, Apr 04 '12 at 14:54
@J.F.Sebastian It seems that removing random data generation removed the spikes too. Saving the data and using it stabilised the curve. — Vlad, Apr 04 '12 at 15:52
Is it possible that Python random data generator precomputes a bunch of numbers in advance, and does it effectively "in parallel" with the main computation loop, slowing it down? A wild guess, really. How long do the spikes last, approximately (in seconds)? — max, Apr 04 '12 at 15:59
Could you try running the code I posted in my answer? The only things you'd need to change is to rearrange your code slightly to match the functions I used (`get_execution_time` and `generate_random_input`). — max, Apr 04 '12 at 16:57
Does `make-figures.py` produces spikes? Does your cpu overheat? Could you fix its frequency (e.g., see `indicator-cpufreq`)? — jfs, Apr 04 '12 at 22:49

score 7 · Answer 1 · edited Nov 02 '16 at 19:39

You are simply picking up the impact of other processes on your machine.

You run your sort function 100 times for input size 1 and record the total time spent on this. Then you run it 100 times for input size 2, and record the total time spent. You continue doing so until you reach input size 1000.

Let's say once in a while your OS (or you yourself) start doing something CPU-intensive. Let's say this "spike" lasts as long as it takes you to run your sort function 5000 times. This means that the execution times would look slow for 5000 / 100 = 50 consecutive input sizes. A while later, another spike happens, and another range of input sizes look slow. This is precisely what you see in your chart.

I can think of one way to avoid this problem. Run your sort function just once for each input size: 1, 2, 3, ..., 1000. Repeat this process 100 times, using the same 1000 inputs (it's important, see explanation at the end). Now take the minimum time spent for each input size as your final data point for the chart.

That way, your spikes should only affect each input size only a few times out of 100 runs; and since you're taking the minimum, they will likely have no impact on the final chart at all.

If your spikes are really really long and frequent, you of course might want to increase the number of repetitions beyond the current 100 per input size.

Looking at your spikes, I notice the execution slows down exactly 3 times during a spike. I'm guessing the OS gives your python process one slot out of three during high load. Whether my guess is correct or not, the approach I recommend should resolve the issue.

EDIT:

I realized that I didn't clarify one point in my proposed solution to your problem.

Should you use the same input in each of your 100 runs for the given input size? Or should use 100 different (random) inputs?

Since I recommended to take the minimum of the execution times, the inputs should be the same (otherwise you'll be getting incorrect output, as you'll measuring the best-case algorithm complexity instead of the average complexity!).

But when you take the same inputs, you create some noise in your chart since some inputs are simply faster than others.

So a better solution is to resolve the system load problem, without creating the problem of only one input per input size (this is obviously pseudocode):

seed = 'choose whatever you like'
repeats = 4
inputs_per_size = 25
runtimes = defaultdict(lambda : float('inf'))
for r in range(repeats):
  random.seed(seed)
  for i in range(inputs_per_size):
    for n in range(1000):
      input = generate_random_input(size = n)
      execution_time = get_execution_time(input)
      if runtimes[(n, i)] > execution_time:
        runtimes[(n,i)] = execution_time
for n in range(1000):
  runtimes[n] = sum(runtimes[(n,i)] for i in range(inputs_per_size))/inputs_per_size

Now you can use runtimes[n] to build your plot.

Of course, depending if your system is super-noisy, you might change (repeats, inputs_per_size) from (4,25) to say, (10,10), or even (25,4).

While using `time.clock()` is not the optimal way of timing Python code, it only measures the processor time spend on the process itself, not the wall time. Other processes shouldn't have too much of an impact. — Sven Marnach, Apr 03 '12 at 21:49
The spikes observed by OP, however, seem to clearly suggest that for a while his program runs 3 times slower, and then goes back to "normal". I don't know what system the OP uses and whether multiple cores or hyperthreads affect this function. My point is that, regardless of the cause, the problem is that he clustered all the runs for the same input size together; and the solution is to spread them across time. — max, Apr 03 '12 at 21:55
I can even see how the system spikes last around the same time every time; since the horizontal width of the spikes falls off as input size increases in a way that keeps the total length of the spike (in seconds) roughly unchanged. — max, Apr 03 '12 at 22:14
@max, I was thinking the same thing -- the layout of the spikes is very striking and strongly suggests that they are occurring for a fixed period and at fixed intervals. — senderle, Apr 04 '12 at 02:21
@max I believe this has something to do with the OS scheduler, and other processes interfering with Python. But how come..? All the other cores are free, average CPU load is 1-3% (with the Python process taking 100%), nothing`s happening in town. Is there a way to debug such situations? — Vlad, Apr 04 '12 at 05:29
What is OS/CPU? Can you output the time each test starts, so we can find out the time each spike begins and ends? If, like I think, spikes happen at equal intervals, and have the same duration, that might give us a hint. — max, Apr 04 '12 at 05:41
@max I've added a comment on OS/CPU into the original question (UPD3). Will update the code later. — Vlad, Apr 04 '12 at 05:47

score 5 · Answer 2 · edited May 23 '17 at 10:29

I can reproduce the spikes using your code:

spikes in measured time performance

You should choose an appropriate timing function (time.time() vs. time.clock() -- from timeit import default_timer), number of repetitions in a test (how long each test takes), and number of tests to choose the minimal time from. It gives you a better precision and less external influence on the results. Read the note from timeit.Timer.repeat() docs:

It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

timeit module can choose appropriate parameters for you:

$ python -mtimeit -s 'from m import testdata, sort; a = testdata[:500]' 'sort(a)'

Here's timeit-based performance curve:

time peformance of sorting functions

The figure shows that sort() behavior is consistent with O(n*log(n)):

|------------------------------+-------------------|
| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 1.00  log2(N)   +  1.25e-015 | N                 |
| 2.00  log2(N)   +  5.31e-018 | N*N               |
| 1.19  log2(N)   +      1.116 | N*log2(N)         |
| 1.37  log2(N)   +      2.232 | N*log2(N)*log2(N) |

To generate the figure I've used make-figures.py:

$ python make-figures.py --nsublists 1 --maxn=0x100000 -s vkazanov.msort -s vkazanov.msort_builtin

where:

# adapt sorting functions for make-figures.py
def msort(lists):
    assert len(lists) == 1
    return sort(lists[0]) # `sort()` from the question

def msort_builtin(lists):
    assert len(lists) == 1
    return sorted(lists[0]) # builtin

Input lists are described here (note: the input is sorted so builtin sorted() function shows expected O(N) performance).

The spikes you observe are inherently different from what the OP observed. Yours are normal random noise as you vary the inputs. It is not persistent; in fact, it looks like white noise (no auto-correlation). The OP observes persistent spikes, that stay for a whole range of input sizes, and then disappear at once. Using the minimum would only help if he separates his runs in time. If he takes the minimum over 100 runs, but all 100 runs are done one after another, his spikes will not go away. Luckily, `timeit` implements `repeat` method so that all repeats don't happen one after another. — max, Apr 03 '12 at 22:09
Yes, the nature of spikes is different the OP might be using a single CPU computer and/or Windows (`time.clock()`), but the cure is the same: use the minimum as described in the paragraph with `timeit.Timer.repeat()` in it. — jfs, Apr 03 '12 at 22:35
I respectfully disagree. Using timeit as prescribed won't fix the problem if the OP iterates through his input sizes outside of the timeit function. His spikes are clearly in the range of thousands of individual sort runtimes, so he will still see roughly the same picture that he saw originally. Taking minimum won't help if the spike lasts dozens of times longer than the entire timeit.repeat call. And from his plot it's obvious that this is indeed the case — max, Apr 03 '12 at 22:45
There is no need to guess. @vkazanov could just run the `timeit`-based `make-figures.py` and see the results. Given that he uses multicore CPU, Ubuntu and the computer is not under heavy load then there is nothing obvious about the plot (reread your 1st comment about the 1st figure in the answer that was produced in similar environment by the exact same code). — jfs, Apr 04 '12 at 09:03

Unexpected performance curve from CPython merge sort

2 Answers2