8

I have a list of lists (which can contain up to 90k elements)

[[1,2,3], [1,2,4], [1,2,3], [1,2,4], [1,2,5]]

I would like to assign an id to each elements, where the id is unique, except when the item is duplicated. So for the list above, I'd need this back:

[0,1,0,1,2]

What is the most effective way to do this?

Mazdak
  • 105,000
  • 18
  • 159
  • 188
jamborta
  • 5,130
  • 6
  • 35
  • 55
  • Do the ids have to be sequential? you could easily abuse the `index` method of lists if not: `def get_ids(li): return [li.index(i) for i in li];` which returns `[0, 1, 0, 1, 4]` for `[[1,2,3], [1,2,4], [1,2,3], [1,2,4], [1,2,5]]` – DeepSpace Jul 10 '16 at 11:31
  • 1
    @DeepSpace That takes O(N^2) time. It could be improved by computing a sorted copy of the list and use `bisect` to efficiently associate an index with it, making the time O(N log N) which is the lowerbound for solving this problem using comparisons. – Bakuriu Jul 10 '16 at 11:38

3 Answers3

10

Keep a map of already seen elements with the associated id.

from itertools import count
from collections import defaultdict


mapping = defaultdict(count().__next__)
result = []
for element in my_list:
    result.append(mapping[tuple(element)])

you could also use a list-comprehension:

result = [mapping[tuple(element)] for element in my_list]

Unfortunately lists aren't hashable so you have to convert them to a tuple when storing them as keys of the mapping.


Note the trick of using defaultdict, and count().__next__ to provide unique increasing ids. On python2 you have to replace .__next__ with .next.

The defaultdict will assign a default value when it cannot find a key. The default value is obtained by calling the function provided in the constructor. In this case the __next__ method of the count() generator yields increasing numbers.

As a more portable alternative you could do:

from functools import partial

mapping = defaultdict(partial(next, count()))

An alternative solution, as proposed in the comments, is to just use the index as unique id:

result = [my_list.index(el) for el in my_list]

This is imple however:

  • It takes O(N^2) time instead of O(N)
  • The ids are unique, increasing but not consecutive (which may or may not be a problem)

For comparison of the two solutions see:

In [1]: from itertools import count
   ...: from collections import defaultdict

In [2]: def hashing(seq):
   ...:         mapping = defaultdict(count().__next__)
   ...:         return [mapping[tuple(el)] for el in seq]
   ...: 

In [3]: def indexing(seq):
   ...:    return [seq.index(i) for i in seq]
   ...: 

In [4]: from random import randint

In [5]: seq = [[randint(1, 20), randint(1, 20), randint(1, 20)] for _ in range(90000)]

In [6]: %timeit hashing(seq)
10 loops, best of 3: 37.7 ms per loop

In [7]: %timeit indexing(seq)
1 loop, best of 3: 26 s per loop

Note how for a 90k element list the mapping solution takes less 40 milliseconds while the indexing solution takes 26 seconds.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • 1
    As an alternative functional based approach for first solution `operator.itemgetter(*map(tuple, my_list))(mapping)` – Mazdak Jul 10 '16 at 11:51
  • To make `defaultdict` 2.6+ compatible, you can use `defaultdict(lambda c=count(): next(c))` instead of having to rely on the actual method name or using `functools.partial`... – Jon Clements Jul 10 '16 at 13:43
  • @JonClements Do you mean compatible with python 2.5? Because both `partial` and the `next` built-in functions are available in python2.6 so that's already python2.6 compatible. – Bakuriu Jul 10 '16 at 14:13
  • @Kasramvd Interestingly your functional approach has a much smaller memory print, still the same performance. – jamborta Jul 12 '16 at 21:15
  • @jamborta I thinks you've tested it in pyhton 3.X and if it's so, it's because that `map()` returns an iterator and consumes much less memory than a list. And the unpacking just pass the items to `itemgetter()`. – Mazdak Jul 12 '16 at 21:28
  • @jamborta How are you computing the memory footprint of the two solutions? AFAIK they should be about the same. Contrary to what Kasramvd says unpacking *will* create a tuple for the arguments, so the fact that `map` returns an iterator doesn't reduce the memory used in any way. Also `itemgetter` returns a tuple, not an iterable, thus it wont save memory there too. The difference in memory may be because `itemgetter` is able to create a tuple of the precise size it needs, while `list`s have to keep some empty slots for the amortized operations costs. – Bakuriu Jul 13 '16 at 07:56
  • @jamborta I've just tried to see how memory is used with a 5-million element list. As I have expected *the final result* of using `itemgetter` instead of a list comprehension uses *slightly* less memory (about 10 less *kilobytes*). **However** during the computation the functional approach needs almost twice the memory (memory goes from 550MB to 810MB then back to ~580MB, using the `hashing` function it grows steadily from 550MB to 590MB). That's due to the tuple created to perform the unpacking. – Bakuriu Jul 13 '16 at 09:24
  • @Bakuriu: I have a vector: `x = np.random.rand(50000, 2000)`, using the approach above goes above current level at 2660MB during computation, whereas with the function approach it gets only 107MB higher. This is in python 2.7 by the way. (I used this tool to benchmark memory: https://github.com/ianozsvald/ipython_memory_usage) – jamborta Jul 13 '16 at 12:00
  • @jamborta Do you mean you did: `x=np.random.rand(50000, 2000); mapping = defaultdict(count().__next__);result = itemgetter(*map(tuple, x))(mapping)`? Doing this eats a lot of RAM for me. In fact this should more than double the memory used because you are converting numpy arrays which are very memory efficient to python tuples which aren't as efficient in terms of space... – Bakuriu Jul 13 '16 at 13:24
  • @Bakuriu: It's interesting, the non-functional approach has the same memory footprint in python2 and 3 (about 2GB extra memory), however, the functional one uses about 3GB in python 3, but only 100MB in python 2. – jamborta Jul 13 '16 at 19:44
1

This is how I approached it:

from itertools import product
from random import randint
import time

t0 = time.time()
def id_list(lst):
    unique_set = set(tuple(x) for x in lst)
    unique = [list(x) for x in unique_set]
    unique.sort(key = lambda x: lst.index(x))

    result = [unique.index(i[1]) for i in product(lst, unique) if i[0] == i[1]]

    return result

seq = [[randint(1, 5), randint(1, 5), randint(1, 5)] for i in range(90000)]

print(id_list(seq))

t1 = time.time()

print("Time: %.4f seconds" % (t1-t0))

Which prints out the sequence of ids, along with an approximate time it took to compute a sequence of random integers in a list between 1 and 4, 90000 times.

Time: 2.3397 seconds  # Will slightly differ from computation to computation

The actual time will always be a bit higher, since it needs to be accounted for in the print statement at the end, but it should not be too much of a difference.

I also used the time library to label the time intervals between the start and the end of the code block.

import time

t0 = time.time()

# code block here

t1 = time.time()

# Difference in time: t1 - t0 

The itertools library along with product used in the code segment will speed up the computation too.

RoadRunner
  • 25,803
  • 6
  • 42
  • 75
0

I slight modification of Bakuriu's solution that works only with numpy arrays, it works better in terms of memory footprint and computation (as it does need to cast arrays to tuples):

from itertools import count
from collections import defaultdict
from functools import partial

def hashing_v1(seq):
    mapping = defaultdict(partial(next, count()))
    return [mapping[tuple(el)] for el in seq]

def hashing_v2(seq):
    mapping = defaultdict(partial(next, count()))
    result = []
    for le in seq:
        le.flags.writeable = False
        result.append(mapping[le.data])
    return result

In [4]: seq = np.random.rand(50000, 2000)

In [5]: %timeit hashing_v1(seq)
1 loop, best of 3: 14.1 s per loop

In [6]: %timeit hashing_v2(seq)
1 loop, best of 3: 1.2 s per loop
jamborta
  • 5,130
  • 6
  • 35
  • 55