Assign a number to each unique value in a list

Question

I have a list of strings. I want to assign a unique number to each string (the exact number is not important), and create a list of the same length using these numbers, in order. Below is my best attempt at it, but I am not happy for two reasons:

It assumes that the same values are next to each other
I had to start the list with a 0, otherwise the output would be incorrect

My code:

names = ['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL']
numbers = [0]
num = 0
for item in range(len(names)):
    if item == len(names) - 1:
      break
    elif names[item] == names[item+1]:
        numbers.append(num)
    else:
        num = num + 1
        numbers.append(num)
print(numbers)

I want to make the code more generic, so it will work with an unknown list. Any ideas?

how about sorting the list before applying algorithm – Piotr Kamoda Feb 20 '17 at 16:51 — Piotr Kamoda, Feb 20 '17 at 16:51

Cleb · Accepted Answer · 2018-04-20T06:39:31.947

Without using an external library (check the EDIT for a Pandas solution) you can do it as follows :

d = {ni: indi for indi, ni in enumerate(set(names))}
numbers = [d[ni] for ni in names]

Brief explanation:

In the first line, you assign a number to each unique element in your list (stored in the dictionary d; you can easily create it using a dictionary comprehension; set returns the unique elements of names).

Then, in the second line, you do a list comprehension and store the actual numbers in the list numbers.

One example to illustrate that it also works fine for unsorted lists:

# 'll' appears all over the place
names = ['ll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'll', 'LL', 'HL', 'HL', 'HL', 'll']

That is the output for numbers:

[1, 1, 3, 3, 3, 2, 2, 1, 2, 0, 0, 0, 1]

As you can see, the number 1 associated with ll appears at the correct places.

EDIT

If you have Pandas available, you can also use pandas.factorize (which seems to be quite efficient for huge lists and also works fine for lists of tuples as explained here):

import pandas as pd

pd.factorize(names)

will then return

(array([(array([0, 0, 1, 1, 1, 2, 2, 0, 2, 3, 3, 3, 0]),
 array(['ll', 'hl', 'LL', 'HL'], dtype=object))

Therefore,

numbers = pd.factorize(names)[0]

score 7 · Answer 2 · edited Jan 20 '18 at 01:53

If the condition is that the numbers are unique and the exact number is not important, then you can build a mapping relating each item in the list to a unique number on the fly, assigning values from a count object:

from itertools import count

names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']

d = {}
c = count()
numbers = [d.setdefault(i, next(c)) for i in names]
print(numbers)
# [0, 0, 2, 2, 4, 4, 4, 7, 0]

You could do away with the extra names by using map on the list and a count object, and setting the map function as {}.setdefault (see @StefanPochmann's comment):

from itertools import count

names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']
numbers  = map({}.setdefault, names, count()) # call list() on map for Py3
print(numbers)
# [0, 0, 2, 2, 4, 4, 4, 7, 0]

As an extra, you could also use np.unique, in case you already have numpy installed:

import numpy as np

_, numbers = np.unique(names, return_inverse=True)
print(numbers)
# [3 3 2 2 1 1 1 0 3]

No need for the extra variables if you do `list(map({}.setdefault, names, count()))`. — Stefan Pochmann, Feb 20 '17 at 17:24
In the first solution, you can use `len(d)` instead of `next(c)`, a la: `numbers = [d.setdefault(i, len(d)) for i in names] — RootTwo, Feb 25 '17 at 17:09

score 5 · Answer 3 · edited Feb 20 '17 at 19:43

5

If you have k different values, this maps them to integers 0 to k-1 in order of first appearance:

>>> names = ['b', 'c', 'd', 'c', 'b', 'a', 'b']
>>> tmp = {}
>>> [tmp.setdefault(name, len(tmp)) for name in names]
[0, 1, 2, 1, 0, 3, 0]

edited Feb 20 '17 at 19:43

Mazdak

105,000
18
159
188

answered Feb 20 '17 at 17:34

Stefan Pochmann

27,593
8
44
107

MSeifert · Answer 4 · 2017-02-20T17:10:53.420

To make it more generic you can wrap it in a function, so these hard-coded values don't do any harm, because they are local.

If you use efficient lookup-containers (I'll use a plain dictionary) you can keep the first index of each string without loosing to much performance:

def your_function(list_of_strings):

    encountered_strings = {}
    result = []

    idx = 0
    for astring in list_of_strings:
        if astring in encountered_strings:  # check if you already seen this string
            result.append(encountered_strings[astring])
        else:
            encountered_strings[astring] = idx
            result.append(idx)
            idx += 1
    return result

And this will assign the indices in order (even if that's not important):

>>> your_function(['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL'])
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]

This needs only one iteration over your list of strings, which makes it possible to even process generators and similar.

score 2 · Answer 5 · answered Feb 20 '17 at 17:02

I managed to modify your script very slightly and it looks ok:

names = ['ll', 'hl', 'll', 'hl', 'LL', 'll', 'LL', 'HL', 'hl', 'HL', 'LL', 'HL', 'zzz']
names.sort()
print(names)
numbers = []
num = 0
for item in range(len(names)):
    if item == len(names) - 1:
      break
    elif names[item] == names[item+1]:
        numbers.append(num)
    else:
        numbers.append(num)
        num = num + 1
numbers.append(num)
print(numbers)

You can see it is very simmilar, only thing is that instead adding number for NEXT element i add number for CURRENT element. That's all. Oh, and sorting. It sorts capital first, then lowercase in this example, you can play with sort(key= lambda:x ...) if you wish to change that. (Perhaps like this: names.sort(key = lambda x: (x.upper() if x.lower() == x else x.lower())))

score 0 · Answer 6 · answered Feb 20 '17 at 16:54

Since you are mapping strings to integers, that suggests using a dict. So you can do the following:

d = dict()

counter = 0

for name in names:
    if name in d:
        continue
    d[name] = counter
    counter += 1

numbers = [d[name] for name in names]

score 0 · Answer 7 · answered Sep 29 '17 at 22:00

Here is a similar factorizing solution with collections.defaultdict and itertools.count:

import itertools as it
import collections as ct


names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']

dd = ct.defaultdict(it.count().__next__)
[dd[i] for i in names]
# [0, 0, 1, 1, 2, 2, 2, 3, 0]

Every new occurrence calls the next integer in itertools.count and adds new entry to dd.

iacob · Answer 8 · 2021-03-27T10:53:44.650

0

Pandas' factorize can simply factorize unique strings:

import pandas as pd

codes, uniques = pd.factorize(names)
codes
>>> array([3, 3, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0])

This can also be done in Scikit-learn with LabelEncoder():

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
codes = le.fit_transform(names)
codes
>>> array([3, 3, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0])

edited Mar 27 '21 at 10:53

answered Mar 21 '21 at 22:13

iacob

20,084
6
92
119

score -1 · Answer 9 · answered Feb 20 '17 at 16:55

-1

You can Try This Also:-

names = ['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL']

indexList = list(set(names))

print map(lambda name:indexList.index(name),names)

answered Feb 20 '17 at 16:55

Rakesh Kumar

4,319
2
17
30

@StefanPochmann, yes you can write this also map(indexList.index,names), if you don't need to write lambda – Rakesh Kumar Feb 21 '17 at 07:27

Assign a number to each unique value in a list

9 Answers9

Linked

Related