3

I have a DataFrame that looks like this

df = pd.DataFrame([
    ['a', 1], 
    ['b', 1],
    ['c', 1],
    ['a', 2], 
    ['c', 3], 
    ['b', 4], 
    ['c', 4]
], columns=['item', 'user'])

Where each user is repeated across multiple rows (with different items).

I would like to perform a LabelEncoder/LabelBinarizer like transform (??) to convert the DataFrame into something that looks like this:

pd.DataFrame([
    [1, 1, 1], #user 1
    [1, 0, 0], #user 2
    [0, 0, 1], #user 3
    [0, 1, 1]  #user 4
], columns=['a', 'b', 'c'])

I likely don't want to use pandas (pivot, get_dummies, crosstab), because I want to pass a new user to the transformer:

new_user = pd.DataFrame([
    ['c', 5], 
    ['d', 5]
], columns=['item', 'user'])

And get back something like this:

[0, 0, 1]

Important: solution must solve for the new user case (and dropped 'd' item), and preserve column order as well as dimensions

emehex
  • 9,874
  • 10
  • 54
  • 100
  • Can you define the list of items beforehand? – Dani Mesejo Oct 16 '19 at 20:46
  • @DanielMesejo sure! I'm imagining a `fit()` step that will memorize a list of all the possible items (sort of like LabelEncoder) – emehex Oct 16 '19 at 20:48
  • 1
    @emehex Check out my answer. Fixed few more errors. Sorry jupyer using previous variables. Found bugs after restarting. BTW good question. – Poojan Oct 16 '19 at 21:06

3 Answers3

2
  • Oh boy. Here is what i came up with.
  • Long Chaining. I will break it down.
import pandas as pd
def encode(l):
    return pd.DataFrame(l, columns=['item', 'user'])['item'].unique()

# create dataframe
# group by and get dummies
# remove unncessary colums which are not part of encoding class
# apply to create list
def add_user(l, _key_):
    return  pd.DataFrame(l, columns=['item', 'user']).\
            groupby('user')['item'].apply('|'.join).str.get_dummies().\
            reindex(columns=_key_).fillna(0).astype('int').\
            apply(lambda x: list(x), axis=1)

_key_ = encode ([
    ['a', 1], 
    ['b', 1],
    ['c', 1],
    ['a', 2], 
    ['c', 3], 
    ['b', 4], 
    ['c', 4]
])
add_user([
    ['a', 1], 
    ['b', 1],
    ['c', 1],
    ['a', 2], 
    ['c', 3], 
    ['b', 4], 
    ['c', 4]
], _key_)

Output:

user
1    [1, 1, 1]
2    [1, 0, 0]
3    [0, 0, 1]
4    [0, 1, 1]
add_user([['b',5],['d', 5]], _key_)

Output:

user
5    [0, 1, 0]
  • encode will generate initial keys for your encoder.
  • add_user you can call this function for each new user.
  • Note you can reset_index to get user column.

Soulution 2:

  • inspired from @WeNYoBen's Answer.
import pandas as pd
df = pd.DataFrame([
    ['a', 1], 
    ['b', 1],
    ['c', 1],
    ['a', 2], 
    ['c', 3], 
    ['b', 4], 
    ['c', 4]
], columns=['item', 'user'])
_key_ = df.item.unique()
def add_user(l, _key_):
    df = pd.DataFrame(l, columns=['item','user'])
    return pd.crosstab(df.user, df.item).reindex(columns=_key_.tolist()).fillna(0).astype('int').apply(list, axis=1)

add_user([['b',5],['d', 5]], _key_)
  • Not readable version of add_user function.
def add_user(l, _key_):
    return pd.crosstab(*[[list(x)] for x in list(zip(*l))[::-1]]).reindex(columns=_key_.tolist()).fillna(0).astype('int').apply(list, axis=1)
Poojan
  • 3,366
  • 2
  • 17
  • 33
  • Hey @Poojan, thanks for your work! I've been mulling over the problem a bunch more and just posted the solution I landed on. Might perhaps be of some interest! – emehex Oct 18 '19 at 20:44
  • @emehex That's a good solution. But since you didn't mention the use of sklearn i provided on with pandas. – Poojan Oct 19 '19 at 20:58
1

For this problem I would create a class Encoder, like the following:

class Encoder:

    def __init__(self):
        self.items = None

    def transform(self, lst):
        """Returns a dictionary where the keys are the users_ids and the values are the encoded items"""
        if self.items is None:
            self.items = self.__items(lst)

        users = {}
        for item, user in lst:
            users.setdefault(user, set()).add(item)

        return {user: np.array([item in basket for item in self.items], dtype=np.uint8) for user, basket in users.items()}

    def reset(self):
        self.items = None

    @staticmethod
    def __items(lst):
        seen = set()
        items = []
        for item, _ in lst:
            if item not in seen:
                items.append(item)
                seen.add(item)
        return items

Then, you could use it like this:

encoder = Encoder()
result = encoder.transform(df.values.tolist())  # here df is your original DataFrame
df_result = pd.DataFrame(data=result.values(), columns=encoder.items, index=result.keys())
print(df_result)

Output

   a  b  c
1  1  1  1
2  1  0  0
3  0  0  1
4  0  1  1

Notice that the index in the df_result are the users. Then the new case could be handled like this:

new_user = pd.DataFrame([
    ['c', 5],
    ['d', 5]
], columns=['item', 'user'])
new_user_result = encoder.transform(new_user.values.tolist())
print(pd.DataFrame(data=new_user_result.values(), columns=encoder.items, index=new_user_result.keys()))

Output

   a  b  c
5  0  0  1

Receiving a list and returning a dictionary is a more flexible approach, at least in my opinion. Also returning a dictionary will handle the case were the users are not consecutive integers (they can be UUIDs, for example). Finally in the Encoder class, you also has a reset method, essentially to forget the items.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • 1
    Hey @Daniel Mesejo, I've been mulling over the problem a bunch more and just posted the solution I landed on. Might be of some interest! – emehex Oct 18 '19 at 20:44
1

A solution with some standard scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer

def squish(df, user='user', item='item'):
    df = df.groupby([user])[item].apply(lambda x: ','.join(x))
    X = pd.DataFrame(df)[item]
    return X

cv = CountVectorizer(tokenizer=lambda x: x.split(','))
X = squish(df)
cv.fit_transform(X).todense()

Which will produce:

# matrix([[1, 1, 1],
#         [1, 0, 0],
#         [0, 0, 1],
#         [0, 1, 1]], dtype=int64)

It also solves for the new user case:

new_user = pd.DataFrame([
    ['c', 5],
    ['d', 5]
], columns=['item', 'user'])

X_new = squish(new_user)
cv.transform(X_new).todense()

Correctly yielding:

# matrix([[0, 0, 1]])
emehex
  • 9,874
  • 10
  • 54
  • 100