1

So i'm using insort from bisect to insert strings into a sorted list. It's somehow faster if i use the built in one. **by faster i mean, on average twice as fast (1 millisecond vs 2 millisecond on a 10,000 word list). I run through the unix cmd on a bigger list than the one i have below through:

time script.py < wordList.txt

I'm thinking it has to do with C, but i don't understand how or why. I want to make mine as fast, but without using the built in.

Here it is straight from the bisect source-code:

def insort_left(a, x, lo=0, hi=None):
"""Insert item x in list a, and keep it sorted assuming a is sorted.

If x is already in a, insert it to the left of the leftmost x.

Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""

if lo < 0:
    raise ValueError('lo must be non-negative')
if hi is None:
    hi = len(a)
while lo < hi:
    mid = (lo+hi)//2
    if a[mid] < x: lo = mid+1
    else: hi = mid
a.insert(lo, x)

Bisect Source Code

This is the part that i think makes it different:

# Overwrite above definitions with a fast C implementation
try:
    from _bisect import *
except ImportError:
    pass

Here is a list of input:

#wordlist = ["hello", "jupiter", "albacore", "shrimp", "axe", "china", "lance", "peter", "sam", "uncle", "jeniffer", "alex", "fer", "joseph"]

Some code to make it work:

sorted_list = []
for line in wordlist:
    insort_left(sorted_list, line)
print(sorted_list)

So my concern is implementing a C based insort in python without using modules. How can i do this?

user1831680
  • 73
  • 1
  • 8
  • The question was 'how do i make it as fast'. it doesn't show the c implementation. And i looked up '_bisect python', and i couldn't find the implementation. – user1831680 Sep 03 '13 at 23:38
  • You can't make it *as fast* because the Python code has to do more work (mainly, Python VM/interpreter has to do alot of work to run the Python code). The algorithm itself is trivial to port to C here (and the code should be located in the native library source code). Now, something like RPython can help, but *it simply takes more work to run Python* than it does to run the (already compiled) C. This constant factor will favor C (but will not change the complexity of the algorithm). – user2246674 Sep 03 '13 at 23:39

1 Answers1

2

Python executing on a VM will never be as fast as native code in a case like this.

The reason is that the Python VM simply has to do more work to execute the Python code. The C code is already compiled and is executed directly. While the C extension code still needs to access the Python VM/runtime it has the additional benefit that it can perform certain operations directly through the exposed C API. There is the RPython project which can eliminate some of the additional runtime execution overhead of the Python code due to typing, but it has restrictions (no pun).

Instead of trying to make the Python code "faster" than C (which will not happen for the same algorithm), make the Python code "smarter" than C.

In this case, the algorithm used has a poor complexity of ~O(n^2) as it is effectilvely nested loops - one loop for reading a line, one loop in the nested insort call (for the insert, at least). This complexity can and likely should be improved to ~O(n lg n) and it will make a difference in performance for n above some (relatively small) value.

Assume that lines is a list containing all lines in the file, then consider the following approaches. Do not re-run these sort proposals inside a loop!

  1. lines.sort() - built-in list.sort, is a good hybrid sort and has a much better bounds than the above use of repeated calls to insort. This is possibly cheating because it still uses "native C" implementation provided by Python. In any case, this should absolutely crush the insort implementation for more than a couple [dozen] lines.

  2. largeSort(lines) - where largeSort is a "pure Python" implementation of a merge sort or a comb sort. For large enough n (across unsorted data) this will be faster than the insort C code. Due to the additional constant performance overhead of executing Python code, tests will need to be conducted to find out for which value of n this starts to dominate.

  3. smallSort(lines) - where smallSort is a "pure Python" insertion sort implementation. This may be better for really small lists and can be performed online/streaming like the insort approach. Performance tests would need to be done, but this may be faster than (a "pure Python") insort in this particular case.

In any case, the precise constant factor overheads and datasets being sorted are relevant. I would suggest looking at (i.e. graphing) the performance over several approaches and anticipated data including worst-case scenarios. Even something such as pre-splitting data may (relatively) benefit one approach over another or yield misleading/non-ideal gains. Another possible consideration is timings ("benchmarks") of the entire Python program may be dominated by overhead not even associated with the sorting mechanism employed.

user2246674
  • 7,621
  • 25
  • 28
  • @user1831680 I don't understand that the splitting of the lists is supposed to achieve or how it will affect the insort bounds (is it a hybrid one-level mergesort?). – user2246674 Sep 04 '13 at 00:06
  • As I have repeatably explained: not possible. And no, the C insort *won't be faster* after some `n`. The "pre-splitting" files is not needed. Also, make sure that you are *using the merge sort correctly*: mergesort or combsort should *not* be used in a loop like the insort is being used. Graph the bounds of the different approaches and, as the size of `n` increases, the graphs should reveal trends. I would expect a mergesort in "pure Python" to beat the proposed insort solution after a few hundred n - tops. – user2246674 Sep 04 '13 at 00:15
  • i understand O-notation. listening is a charm. thanks anyways, – user1831680 Sep 04 '13 at 00:23
  • I am slightly curious about time graphs of the current performance to know for which values of n the loop+insort exhibits better timings (I have no doubt it will hold for smallish n). I'm especially curious because the graphs should look a good bit different. Also, the data can play a huge difference as the bounds mentioned are Big-O - is the data random or already sorted, etc. – user2246674 Sep 04 '13 at 00:29
  • The data can be sorted or unsorted. i was looking to attack both at the same time. i have no graphs. i just compare the times i get (experimentation) and those others have gotten. – user1831680 Sep 04 '13 at 00:39