1

I try to translate a column of a Pandas Data Frame into int values using a mapping like this (asuming a given dataframe: my_dataframe and a colum: target_column):

targets = my_dataframe[target_column].unique()
map_to_int = {name: n  for n, name in enumerate(targets)}

Using Python 3.6 with Pandas I wonder why

A)

my_dataframe['Integer-Column'] = map_to_int[my_dataframe[target_column]]

causes a

TypeError: 'Series' objects are mutable, thus they cannot be hashed

whileas

B)

my_dataframe['Integer-Column'] = my_dataframe[target_column].replace(map_to_int)

works fine.

I would like to understand why this happens. Is there any magic in replace that the TypeError is not thrown or am I missing something else? I already got the fact, that dict-keys are not allowed to be changeable. But still I have a hard time understanding this for real, since:

    words = my_dataframe[target_column].unique()
    # words = ['car' 'bike' 'plain']

    foo = 'car'
    map_to_int[foo] = 0
    foo = 'bike'
    map_to_int["bike"] = 1

Any attempt to help me understand why B) works without the trouble of A) would be appreciated.

jpp
  • 159,742
  • 34
  • 281
  • 339
Simeon
  • 748
  • 1
  • 9
  • 26
  • I found some explanation for the confusing part with strings here: https://stackoverflow.com/questions/9097994/arent-python-strings-immutable-then-why-does-a-b-work The example mapping of foo apparently works, because the strings 'car' or respectively 'bike' behind the label foo are immutable. Even though the label foo can point to various 'immutable targets'. – Simeon Jul 05 '18 at 08:05

2 Answers2

1

Your solution does not work because with map_to_int[my_dataframe[target_column]] you are trying to use a pd.Series object as a dictionary key.

Furthermore, I recommend you use replace in only specific circumstances; for a dictionary mapping you should typically use pd.Series.map, i.e. my_dataframe[target_column].map(map_to_int). See Replace values in a pandas series via dictionary efficiently for more details.

But this functionality is already implemented in Pandas as Categorical Data. I recommend you use categorical data as an efficient and syntactically clean way of mapping items in a series to integers.

Here's an example:

df = pd.DataFrame({'col1': ['a', 'b', 'c', 'a', 'b', 'a', 'd']})

df['col1'] = df['col1'].astype('category').cat.codes

print(df)

   col1
0     0
1     1
2     2
3     0
4     1
5     0
6     3
jpp
  • 159,742
  • 34
  • 281
  • 339
0

Apparently my_dataframe[target_column] is something that python(3.6) considers mutable. Using mutable stuff as key in a dict throws the TypeError mentioned. Hence calling a dictionary like map_to_int with it throws the Error.

In Version B) the dictionary map_to_int is still used but the keys within the dictionary are not explicitly mentioned. Moreover they are the immutable representations of whatever is held within targets. So when the replace function (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) makes use of the dictionary, it uses those immutable keys. Therefore there is no reason for the TypeError to be thrown and that is, what has been observed.

Simeon
  • 748
  • 1
  • 9
  • 26