19

I have a list of strings. I want to assign a unique number to each string (the exact number is not important), and create a list of the same length using these numbers, in order. Below is my best attempt at it, but I am not happy for two reasons:

  1. It assumes that the same values are next to each other

  2. I had to start the list with a 0, otherwise the output would be incorrect

My code:

names = ['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL']
numbers = [0]
num = 0
for item in range(len(names)):
    if item == len(names) - 1:
      break
    elif names[item] == names[item+1]:
        numbers.append(num)
    else:
        num = num + 1
        numbers.append(num)
print(numbers)

I want to make the code more generic, so it will work with an unknown list. Any ideas?

Cleb
  • 25,102
  • 20
  • 116
  • 151
millsy
  • 362
  • 2
  • 3
  • 9

9 Answers9

25

Without using an external library (check the EDIT for a Pandas solution) you can do it as follows :

d = {ni: indi for indi, ni in enumerate(set(names))}
numbers = [d[ni] for ni in names]

Brief explanation:

In the first line, you assign a number to each unique element in your list (stored in the dictionary d; you can easily create it using a dictionary comprehension; set returns the unique elements of names).

Then, in the second line, you do a list comprehension and store the actual numbers in the list numbers.

One example to illustrate that it also works fine for unsorted lists:

# 'll' appears all over the place
names = ['ll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'll', 'LL', 'HL', 'HL', 'HL', 'll']

That is the output for numbers:

[1, 1, 3, 3, 3, 2, 2, 1, 2, 0, 0, 0, 1]

As you can see, the number 1 associated with ll appears at the correct places.

EDIT

If you have Pandas available, you can also use pandas.factorize (which seems to be quite efficient for huge lists and also works fine for lists of tuples as explained here):

import pandas as pd

pd.factorize(names)

will then return

(array([(array([0, 0, 1, 1, 1, 2, 2, 0, 2, 3, 3, 3, 0]),
 array(['ll', 'hl', 'LL', 'HL'], dtype=object))

Therefore,

numbers = pd.factorize(names)[0]
Cleb
  • 25,102
  • 20
  • 116
  • 151
7

If the condition is that the numbers are unique and the exact number is not important, then you can build a mapping relating each item in the list to a unique number on the fly, assigning values from a count object:

from itertools import count

names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']

d = {}
c = count()
numbers = [d.setdefault(i, next(c)) for i in names]
print(numbers)
# [0, 0, 2, 2, 4, 4, 4, 7, 0]

You could do away with the extra names by using map on the list and a count object, and setting the map function as {}.setdefault (see @StefanPochmann's comment):

from itertools import count

names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']
numbers  = map({}.setdefault, names, count()) # call list() on map for Py3
print(numbers)
# [0, 0, 2, 2, 4, 4, 4, 7, 0]

As an extra, you could also use np.unique, in case you already have numpy installed:

import numpy as np

_, numbers = np.unique(names, return_inverse=True)
print(numbers)
# [3 3 2 2 1 1 1 0 3]
pylang
  • 40,867
  • 14
  • 129
  • 121
Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
5

If you have k different values, this maps them to integers 0 to k-1 in order of first appearance:

>>> names = ['b', 'c', 'd', 'c', 'b', 'a', 'b']
>>> tmp = {}
>>> [tmp.setdefault(name, len(tmp)) for name in names]
[0, 1, 2, 1, 0, 3, 0]
Mazdak
  • 105,000
  • 18
  • 159
  • 188
Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
3

To make it more generic you can wrap it in a function, so these hard-coded values don't do any harm, because they are local.

If you use efficient lookup-containers (I'll use a plain dictionary) you can keep the first index of each string without loosing to much performance:

def your_function(list_of_strings):

    encountered_strings = {}
    result = []

    idx = 0
    for astring in list_of_strings:
        if astring in encountered_strings:  # check if you already seen this string
            result.append(encountered_strings[astring])
        else:
            encountered_strings[astring] = idx
            result.append(idx)
            idx += 1
    return result

And this will assign the indices in order (even if that's not important):

>>> your_function(['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL'])
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]

This needs only one iteration over your list of strings, which makes it possible to even process generators and similar.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
2

I managed to modify your script very slightly and it looks ok:

names = ['ll', 'hl', 'll', 'hl', 'LL', 'll', 'LL', 'HL', 'hl', 'HL', 'LL', 'HL', 'zzz']
names.sort()
print(names)
numbers = []
num = 0
for item in range(len(names)):
    if item == len(names) - 1:
      break
    elif names[item] == names[item+1]:
        numbers.append(num)
    else:
        numbers.append(num)
        num = num + 1
numbers.append(num)
print(numbers)

You can see it is very simmilar, only thing is that instead adding number for NEXT element i add number for CURRENT element. That's all. Oh, and sorting. It sorts capital first, then lowercase in this example, you can play with sort(key= lambda:x ...) if you wish to change that. (Perhaps like this: names.sort(key = lambda x: (x.upper() if x.lower() == x else x.lower())) )

Piotr Kamoda
  • 956
  • 1
  • 9
  • 24
0

Since you are mapping strings to integers, that suggests using a dict. So you can do the following:

d = dict()

counter = 0

for name in names:
    if name in d:
        continue
    d[name] = counter
    counter += 1

numbers = [d[name] for name in names]
Nir Friedman
  • 17,108
  • 2
  • 44
  • 72
0

Here is a similar factorizing solution with collections.defaultdict and itertools.count:

import itertools as it
import collections as ct


names = ['ll', 'll', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'll']

dd = ct.defaultdict(it.count().__next__)
[dd[i] for i in names]
# [0, 0, 1, 1, 2, 2, 2, 3, 0]

Every new occurrence calls the next integer in itertools.count and adds new entry to dd.

pylang
  • 40,867
  • 14
  • 129
  • 121
0

Pandas' factorize can simply factorize unique strings:

import pandas as pd

codes, uniques = pd.factorize(names)
codes
>>> array([3, 3, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0])

This can also be done in Scikit-learn with LabelEncoder():

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
codes = le.fit_transform(names)
codes
>>> array([3, 3, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0])
iacob
  • 20,084
  • 6
  • 92
  • 119
-1

You can Try This Also:-

names = ['ll', 'll', 'll', 'hl', 'hl', 'hl', 'LL', 'LL', 'LL', 'HL', 'HL', 'HL']

indexList = list(set(names))

print map(lambda name:indexList.index(name),names)
Rakesh Kumar
  • 4,319
  • 2
  • 17
  • 30