0

I have multiple lists of features which are strings that I want to analyze. That is, e.g.:

[["0.5", "0.4", "disabled", "0.7", "disabled"], ["feature1", "feature2", "feature4", "feature1", "feature3"]]

I know how to convert strings like "0.5" to floats, but is there a way to "normalize" such lists to integer or float values (each list independently in my case)? I would like to get something like this:

[[2, 1, 0, 3, 0], [0, 1, 3, 0, 2]]

Does anyone know how to achieve this? Unfortunately I couldn't to find anything related to this problem yet.

martineau
  • 119,623
  • 25
  • 170
  • 301
Leado
  • 25
  • 6
  • that's not a normalization, that's a ranking – Walter Tross Aug 25 '20 at 13:40
  • @WalterTross That might be, but I want to emphasize that the order in ranking is not important to me. The only thing necessary is that identical strings get identical integers. – Leado Aug 25 '20 at 13:45

2 Answers2

0

A bit messy but should probably do what you want - use a dictionary to keep track of the items in the list that you've used. You could replace the for loops with generators to make this less verbose.

def track_items_in_list(test_list):
    outer_list = []
    # iterate through outer list
    for _list in test_list:
        # unique_count is an integer that corresponds to an item in your list
        unique_count = 0
        # used_tracker matches the unique_count with an item in your list
        used_tracker = {}
        inner_list = []
        # iterate through inner list
        for _item in _list:
            # check the used_tracker to see if the item has been used - if so, replace with the corresponding v'unique count'
            if _item in used_tracker:
                inner_list.append(used_tracker[_item])
            else:
                # if not, add the count to the tracker
                inner_list.append(unique_count)
                used_tracker[_item] = unique_count
                unique_count += 1
         outer_list.append(inner_list)

track_items_in_list([["0.5", "0.4", "disabled", "0.7", "disabled"], ["feature1", "feature2", "feature4", "feature1", "feature3"]])
# [[0, 1, 2, 3, 2], [0, 1, 2, 0, 3]]
bm13563
  • 688
  • 5
  • 18
0

Use a dictionary and a counter to give IDs to new values and remember past IDs:

import itertools, collections

def norm(lst):
    d = collections.defaultdict(itertools.count().__next__)
    return [d[s] for s in lst]

lst = [["0.5", "0.4", "disabled", "0.7", "disabled"],
       ["feature1", "feature2", "feature4", "feature1", "feature3"]]
print(list(map(norm, lst)))
# [[0, 1, 2, 3, 2], [0, 1, 2, 0, 3]]

Or by enumerating sorted unique values; note, however, that "disables" sorts after the numeric values:

def norm_sort(lst):
    d = {x: i for i, x in enumerate(sorted(set(lst)))}
    return [d[s] for s in lst]

print(list(map(norm_sort, lst)))
[[1, 0, 3, 2, 3], [0, 1, 3, 0, 2]]
tobias_k
  • 81,265
  • 12
  • 120
  • 179