8

I'm trying to do a pretty simple task in Python that I have already done in Julia. It consists of taking an array of multiple 3d elements and making a dictionary of indexes of unique values from that list (note the list is 6,000,000 elements long). I have done this in Julia and it is reasonably fast (6 seconds) - here is the code:

function unique_ids(itr)
#create dictionnary where keys have type of whatever itr is 
 d = Dict{eltype(itr), Vector}()
#iterate through values in itr
 for (index,val) in enumerate(itr)
    #check if the dictionary 
   if haskey(d, val)
     push!(d[val],index)
   else
     #add value of itr if its not in v yet 
     d[val] = [index]
   end
 end
 return collect(values(d))
end

So far so good. However, when I try doing this in Python, it seems to take forever, so long that I can't even tell you how long. So the question is, am I doing something dumb here, or is this just the reality of the differences between these two languages? Here is my Python code, a translation of the Julia code.

def unique_ids(row_list):
    d = {}
    for (index,val) in tqdm(enumerate(row_list)):
        if str(val) in d:
            d[str(val)].extend([index])
        else:
            d[str(val)] = [index]
    return list(d.values())

Note that I use strings for the keys of the dict in Python as it is not possible to have an array as a key in Python.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Ma boi
  • 89
  • 1
  • it is possible to have a tuple as key, that will cut down the time greatly. – Ahmed AEK Oct 15 '22 at 16:05
  • If the keys in your Julia dict are `String`s quite likely you can still speed up the Julia code by using `Symbol`s or using `ShortStrings.jl ` instead of just using `String`s (depends on particular use case scenario but the speedup could be significant) – Przemyslaw Szufel Oct 15 '22 at 16:07
  • 3
    can you provide a minimal reproducible example that people can copy and paste ? or an example of what `row_list` is like ? people will explain what you are doing wrong in python. – Ahmed AEK Oct 15 '22 at 16:08
  • it's not uncommon for Julia to have a 100x-1000x speed up compare to pure Python for loops – jling Oct 15 '22 at 16:14
  • 3
    My suspicion is that the `tqdm` call causes the whole list to be read into memory at once, where you'd like to keep a generator. Does it help if you take it out? It's definitely not necessary, and your Julia code does not seem to be doing anything like this. Perhaps see also https://stackoverflow.com/questions/49320007/how-to-use-tqdm-to-iterate-over-a-list – tripleee Oct 15 '22 at 16:17
  • I would expect it to be faster, since Julia does JIT compilation, that said, if it takes so long you it does not complete, perhaps it has a bug - have you executed in a debugger to ensire it is progressing?. You are probably not stupid. I am not familiar with either language but are they even equivalent implementations? – Clifford Oct 15 '22 at 16:19
  • 1
    See https://julialang.org/benchmarks/. Julia is certainly _expected_ to be far faster than idiomatic Python. – Charles Duffy Oct 15 '22 at 16:25
  • 4
    Your Julia dictionary is not concretely typed, that may cause some loss of performance. `Vector` by itself is not fully specified. You presumably mean `Dict{eltype(itr), Vector{Int}}()` – DNF Oct 15 '22 at 16:38
  • 3
    I checked, fixing the type signature of the dictionary gave me a 2.5x speedup. – DNF Oct 15 '22 at 16:54
  • Minor note: `d[str(val)].extend([index])` is silly. Just use `d[str(val)].append(index)`. Also, if `val` is already a `str`, stop calling `str` on it; if it's not a `str`, call it once upfront and don't convert it multiple times. And `collections.defaultdict(list)` *exists* for avoiding the needlessly complex code you've got checking for the existence of a key each time. – ShadowRanger Oct 15 '22 at 21:22
  • Note that the Julia code should use `pairs(itr)` rather than `enumerate(itr)` in the loop header, to be properly generic. Using `pairs` is just as performant too (it actually improves the performance very slightly, by 1-2%), so it's a good practice to get into. – Sundar R Oct 16 '22 at 03:33
  • I made a simple [multi-threaded version of the Julia code](https://gist.github.com/digital-carver/1579a6321b4305a02f32627b52363b85), feedback welcome. On my machine, `julia --threads=auto` starts with 4 threads available, and with that, this code runs about 2.5x faster than the serial version. – Sundar R Oct 16 '22 at 04:43
  • (That's 2.5x compared to the code with the concretely typed `Dict{eltype(itr), Vector{Int}}()`. When compared to the code in the question, it's 3.5x faster for me.) – Sundar R Oct 16 '22 at 05:13
  • Thanks a lot for all this great informations ! learning a lot about small coding stuff I love it. It seems that the second comment got a properly coded function that works well so will use that ! thank you all – Ma boi Oct 16 '22 at 19:42
  • @Maboi Make sure to also use `Vector{Int}` in the dictionary signature, that's a fundamental concern in Julia. – DNF Oct 17 '22 at 07:18

2 Answers2

2

I think the bottom line is this type of function can definitely run in python in less than 6 seconds.

I think the main issue as many people have pointed out is tqdm and using a string in the dictionary. If you take these out it gets a lot faster. Interesting, swapping to the collections.defaultdict really helps as well. If I get a moment I will write the equivalent function in C++ using the python C API and I will append those results.

I have included the test code below, for 1 test iteration with 6,000,000 I get 4.9 secs best in python; this is with an i9-9980XE processor, I don't know what your test is on. Note the key is critical, if I swap the tuple to be an int the time is 1.48 secs, so how the input data is presented makes a huge difference.

Method Time Relative
Original 16.01739319995977 10.804717667865178
Removing tqdm 12.82462279999163 8.650997501336496
Removing to string and just using tuple 5.3935559000237845 3.6382854561971936
Removing double dictionary lookup 4.682285099988803 3.1584895191283664
Using collections defaultdict 4.493273599946406 3.0309896277014063
Using defaultdict with int key 1.4824443999677896 1.0

Looking over a smaller dataset (1,000,000), but more iterations (100) I get a closer gap:

Method Time Relative
Original 253.63316379999742 4.078280268213264
Removing tqdm 195.89607029996114 3.1498999032904
Removing to string and just using tuple 69.18050129996845 1.1123840004584065
Removing double dictionary lookup 68.65376710001146 1.1039144073575153
Using collections defaultdict 62.19120489998022 1.0

The Julia benchmarks do look very interesting. I haven't had a chance to look at these in detail but I do wonder how much the python benchmarks leverage libraries like numpy as scipy.

With this test code:

from tqdm import tqdm
import timeit, random
from collections import defaultdict

random.seed(1)

rand_max = 100
data_size = 1000000
iterations = 100

data = [tuple(random.randint(0, rand_max) for i in range(3)) for j in range(data_size)]
data2 = [t[0] for t in data]

def method0(row_list):
    d = {}
    for (index,val) in tqdm(enumerate(row_list)):
        if str(val) in d:
            d[str(val)].extend([index])
        else:
            d[str(val)] = [index]
    return list(d.values())

def method1(row_list):
    d = {}
    for index,val in enumerate(row_list):
        if str(val) in d:
            d[str(val)].extend([index])
        else:
            d[str(val)] = [index]
    return list(d.values())

def method2(row_list):
    d = {}
    for index, val in enumerate(row_list):
        if val in d:
            d[val].extend([index])
        else:
            d[val] = [index]
    return list(d.values())

def method3(row_list):
    d = {}
    for index, val in enumerate(row_list):
        if (l := d.get(val)):
            l.append(index)
        else:
            d[val] = [index]
    return d.values()

def method4(row_list):
    d = defaultdict(list)
    for (index,val) in enumerate(row_list):
        d[val].append(index)
    return list(d.values())

assert (m0 := method0(data)) == method1(data)
assert m0 == method2(data)
assert (m0 := sorted(m0)) == sorted(method3(data))
assert m0 == sorted(method4(data))

t0 = timeit.timeit(lambda: method0(data), number=iterations)
t1 = timeit.timeit(lambda: method1(data), number=iterations)
t2 = timeit.timeit(lambda: method2(data), number=iterations)
t3 = timeit.timeit(lambda: method3(data), number=iterations)
t4 = timeit.timeit(lambda: method4(data), number=iterations)

tmin = min((t0, t1, t2, t3, t4))

print(f'| Method                                  | Time | Relative      |')
print(f'|------------------                       |----------------------|')
print(f'| Original                                | {t0} | {t0 / tmin}   |')
print(f'| Removing tqdm                           | {t1} | {t1 / tmin}   |')
print(f'| Removing to string and just using tuple | {t2} | {t2 / tmin}   |')
print(f'| Removing double dictionary lookup       | {t3} | {t3 / tmin}   |')
print(f'| Using collections defaultdict           | {t4} | {t4 / tmin}   |')
John M.
  • 775
  • 4
  • 16
0

python is slower, it's just not as slow as the author thinks, to show this with optimized code:

language time (seconds)
cpython 3.11 4.6
cpython + Numba (Jit) 3.3
Julia 1.8.5 1.9

python is only 2.5 times slower at this, and only 1.5 times slower when numba is in the mix, so if you need the last drop of performance then use julia, but python would win in other aspects, like being compiled to a lightweight executable or having more libraries for general purpose non-numerical programming.

optimized python only code:

import random
from collections import defaultdict

input_list_len = 6_000_000
rand_max = 100
row_list = []
for _ in range(input_list_len):
    row_list.append(tuple(random.randint(0,rand_max) for x in range(3)))

def unique_ids(row_list):
    d = defaultdict(list)
    for (index,val) in enumerate(row_list):
        d[val].append(index)

    return list(d.values())

import time
t1 = time.time()
output = unique_ids(row_list)
t2 = time.time()
print(f"total time = {t2-t1}")

optimized numba LLVM jit code

import random
import numba
from numba.typed.typedlist import List
from numba.typed.typeddict import Dict
import numba.types
input_list_len = 6_000_000
rand_max = 100

@numba.njit
def generate_list():
    row_list = List()
    for _ in range(input_list_len):
        a = random.randint(0,rand_max)
        b = random.randint(0,rand_max)
        c = random.randint(0,rand_max)
        row_list.append((a,b,c))
    return row_list

row_list = generate_list()

@numba.njit("ListType(ListType(int64))(ListType(UniTuple(int64,3)))")
def unique_ids(row_list):
    d = Dict()
    for (index,val) in enumerate(row_list):
        if val in d:
            d[val].append(index)
        else:
            a = List()
            a.append(index)
            d[val] = a

    return List(d.values())

import time
t1 = time.time()
output = unique_ids(row_list)
t2 = time.time()
print(f"total time = {t2-t1}")

optimized julia code

using Random
Random.seed!(3);

input_list_len = 6_000_000
rand_max = 100
data::Vector{Tuple{Int64,Int64,Int64}} = Vector{Tuple{Int64,Int64,Int64}}()
for _ in 1:input_list_len
    push!(data, Tuple{Int64,Int64,Int64}(rand(0:rand_max,3)))
end

function unique_ids(itr)
    #create dictionnary where keys have type of whatever itr is 
     d = Dict{eltype(itr), Vector{Int}}()
    #iterate through values in itr
     for (index,val) in enumerate(itr)
        #check if the dictionary 
       if haskey(d, val)
         push!(d[val],index);
       else
         #add value of itr if its not in v yet 
         d[val] = [index]
       end
     end
     return collect(values(d))
    end

@time unique_ids(data);
@time unique_ids(data);
Ahmed AEK
  • 8,584
  • 2
  • 7
  • 23
  • 2
    or use pypy which is also JIT – qwr Oct 15 '22 at 16:28
  • 4
    Languages are not interpreted or compiled; *implementations* are. CPython (what everyone thinks of as "Python") compiles Python source to Python byte code, which is then interpreted by a virtual machine. There are other Python implementations, though: PyPy is also a JIT compiler for Python. – chepner Oct 15 '22 at 16:29
  • you guys are correct, i forgot that part for a second as i was trying to mimic the author's question. – Ahmed AEK Oct 15 '22 at 16:31
  • "write this operation in C/C++ and add it to CPython very easily to be faster than julia or any other compiled language"... pardon? C/C++ _are_ compiled languages, and Julia is sometimes faster than them (there are operations where it gets into Fortran territory; and yes, Fortran is still faster than C in some cases). See https://julialang.org/benchmarks/ -- the blue line is C, and you'll see there are several languages that have benchmarks where they come out faster than an idiomatic C implementation. So I'm not sure what your statement means, and _even more_ unsure about its correctness. – Charles Duffy Oct 15 '22 at 20:45
  • ...remember, calling between C and Python means you've got ABI overhead at the interface point, plus whatever parts _are_ native Python are still running at the Python interpreter's pace (and whatever parts of the C module are calling into CPython to inspect Python-native structures are likewise bound by the Python runtime implementation's performance decisions and tradeoffs). It'll never be faster than 100% C, unless you're using the word to mean "faster to develop", which is an entirely different comparison. – Charles Duffy Oct 15 '22 at 20:47
  • to be honest it's the best of both worlds, the applications are usually faster to develop in python, and tight loops are rewritten in C to get near-C performance, if someone is developing a large scale application on a tight time budget then python is usually the tool for the job, i'll modify the answer slightly to reflect that. – Ahmed AEK Oct 15 '22 at 20:51
  • 1
    I agree that Python often makes excellent tradeoffs, but edit2 doesn't describe things in terms of tradeoffs -- it's making absolute assertions. I have a lot of history with Python and C under my belt, and while I like Julia immensely in theory -- it's a language that's about as readable and easy to write as Python, while compiling to code that's comparable if not faster than what a good C compiler puts out -- Julia also has a history of making changes that break API compatibility across releases more frequently than folks might expect with a language that's been stable longer. – Charles Duffy Oct 15 '22 at 20:53
  • 2
    ...which is to say, I would often, personally, choose to use Python+C instead of Julia for compelling and practical reasons (library availability being another of those reasons), but I would never claim that the combination has better performance; julia's combination of ease-of-use and performance is _exceptional_. – Charles Duffy Oct 15 '22 at 20:54
  • 1
    (folks reviewing the above comments should consider them in the context of the "edit2" section [as it existed as of revision 8 of this answer](https://stackoverflow.com/revisions/74081117/8)) – Charles Duffy Oct 15 '22 at 20:59
  • 1
    What history of breaking API compatibility is that, @CharlesDuffy? Are you saying that they are not keeping the guarantees of semantic versioning, which they claim to adhere to? – DNF Oct 15 '22 at 21:56
  • @DNF, to be honest, it's quite possible that I'm still judging them by pre-1.0 churn. Would need to dig into an ex-employer's codebase I no longer have access to to find the examples in mind (or nail down a specific timeline). – Charles Duffy Oct 15 '22 at 22:03
  • 2
    I would think that is the case. 1.0 was released in 2018. I am not aware of any breaking changes since, and that is supposed to be a guarantee. – DNF Oct 15 '22 at 22:08
  • 1
    pre-2018 is _extremely_ believable. – Charles Duffy Oct 15 '22 at 22:16
  • I ran your code as well as Julia's and still Julia's code is 2X faster. If I time the whole code including data creation in both languages, Julia is even 7X faster. – AboAmmar Oct 15 '22 at 22:42