User based encoding/convert with its interaction in pandas

Question

I have this dataframe which looks like this:

user_id : Represents user

question_id : Represent question number

user_answer : which option user has opted for the specific question from (A,B,C,D)

correct_answer: What is correct answer for that specific question

correct : 1.0 it means user answer is right

elapsed_time : it represents time in minutes user took to answer that question

timestamp : UNIX TIMESTAMP OF EACH INTERACTION

real_date : I have added this column and converted timestamp to human date & time

** user_iD **	** question_id **	** user_answer **	correct_answer	correct	** elapsed_time **	** solving_id **	** bundle_id **	timestamp	real_date
1	1	A	A	1.0	5.00	1	b1	1547794902000	Friday, January 18, 2019 7:01:42 AM
1	2	D	D	1.0	3.00	2	b2	1547795130000	Friday, January 18, 2019 7:05:30 AM
1	5	C	C	1.0	7.00	5	b5	1547795370000	Friday, January 18, 2019 7:09:30 AM
2	10	C	C	1.0	5.00	10	b10	1547806170000	Friday, January 18, 2019 10:09:30 AM
2	1	B	B	1.0	15.0	1	b1	1547802150000	Friday, January 18, 2019 9:02:30 AM
2	15	A	A	1.0	2.00	15	b15	1547803230000	Friday, January 18, 2019 9:20:30 AM
2	7	C	C	1.0	5.00	7	b7	1547802730000	Friday, January 18, 2019 9:12:10 AM
3	12	A	A	1.0	1.00	25	b12	1547771110000	Friday, January 18, 2019 12:25:10 AM
3	10	C	C	1.0	2.00	10	b10	1547770810000	Friday, January 18, 2019 12:20:10 AM
3	3	D	D	1.0	5.00	3	b3	1547770390000	Friday, January 18, 2019 12:13:10 AM
104	6	C	C	1.0	6.00	6	b6	1553040610000	Wednesday, March 20, 2019 12:10:10 AM
104	4	A	A	1.0	5.00	4	b4	1553040547000	Wednesday, March 20, 2019 12:09:07 AM
104	1	A	A	1.0	2.00	1	b1	1553040285000	Wednesday, March 20, 2019 12:04:45 AM

I need to do some encoding , I don't know which encoding should I do and how?

What i need a next dataframe to look like this :

user_id	b1	b2	b3	b4	b5	b6	b10	b12	b15
1	1	2	0	0	3	0	0	0	0
2	1	0	0	0	0	0	2	0	3
3	0	0	1	0	0	0	2	3	0
104	1	0	0	2	0	3	0	0	0

As you can see with the help of timestamp and real_date ; the question_id of each user is not sorted, The new dataframe should contain which of the bundles user has interacted with, time-based sorted.

Your question is not totally clear. In any case, you are discussing about format of a table, not the encoding. I assume you care only the first 3 columns, right? Map the second column to add the 'b'. Set index on the first two columns (they are the index, the 3rd column is the value). Then unstack() the second index, and you are nearly done. Now you should map the values (letter -> digit). From format I assume there are multiple choice questions, in such case your format will be more tricky — Giacomo Catenazzi, Jan 18 '22 at 07:47
@GiacomoCatenazzi : It's not about formatting a table. lemme clear this : As you can see each user has attempted different questions and in a different order. I need to encode this that each user should show which bundles he/she attempted in an orderly manner. This is just a sample dataframe. I have a huge dataset that contains 30 thousand users and around 9 million interactions. — Haseeb Sultan, Jan 18 '22 at 07:51
*Format* as *how various data are put together*. The date 2022-01-18 is a format, not an encoding. Encoding is usually meant for a transformation on every element. Every `encode` requires a `decode`; but a format it is just a different way to show the same data. *Formatting* is a different thing: we interpret it as nice format for humans (possibly losing precision [less digits]). For this reason we tend to use *change format* (or convert) and not formatting for the first meaning. — Giacomo Catenazzi, Jan 18 '22 at 07:59
Please [edit] your question to share a [mcve]. Questions that only ask for code are too broad and are likely to be [put on hold or closed](https://stackoverflow.com/help/closed-questions). — JosefZ, Jan 20 '22 at 20:39

Corralien · Accepted Answer · 2022-01-23T19:54:47.343

First create the final value for each bundle element using groupby and cumcount then pivot your dataframe. Finally reindex it to get all columns:

bundle = [f'b{i}' for i in range(1, 16)]

values = df.sort_values('timestamp').groupby('user_iD').cumcount().add(1)

out = (
  df.assign(value=values).pivot_table('value', 'user_iD', 'bundle_id', fill_value=0)
    .reindex(bundle, axis=1, fill_value=0)
)

Output:

>>> out
bundle_id  b1  b2  b3  b4  b5  b6  b7  b8  b9  b10  b11  b12  b13  b14  b15
user_iD                                                                    
1           1   2   0   0   3   0   0   0   0    0    0    0    0    0    0
2           1   0   0   0   0   0   2   0   0    4    0    0    0    0    3
3           0   0   1   0   0   0   0   0   0    2    0    3    0    0    0
104         1   0   0   2   0   3   0   0   0    0    0    0    0    0    0

>>> out.reset_index().rename_axis(columns=None)
   user_iD  b1  b2  b3  b4  b5  b6  b7  b8  b9  b10  b11  b12  b13  b14  b15
0        1   1   2   0   0   3   0   0   0   0    0    0    0    0    0    0
1        2   1   0   0   0   0   0   2   0   0    4    0    0    0    0    3
2        3   0   0   1   0   0   0   0   0   0    2    0    3    0    0    0
3      104   1   0   0   2   0   3   0   0   0    0    0    0    0    0    0

I have used this, I'm getting a problem some of the entries became floating points. like some entries become 2.5 , 3.5, 15.5 etc it should be only integers — Haseeb Sultan, Mar 18 '22 at 05:48

JosefZ · Answer 2 · 2022-01-23T18:23:33.657

Lacking more Pythonish experience, I'm proposing the following (partially commented) code snippet which is not optimized in any way, being based merely on elementary pandas.DataFrame API reference.

import pandas as pd
import io
import sys

data_string = '''
user_iD;question_id;user_answer;correct_answer;correct;elapsed_time;solving_id;bundle_id;timestamp
1;1;A;A;1.0;5.00;1;b1;1547794902000
1;2;D;D;1.0;3.00;2;b2;1547795130000
1;5;C;C;1.0;7.00;5;b5;1547795370000
2;10;C;C;1.0;5.00;10;b10;1547806170000
2;1;B;B;1.0;15.0;1;b1;1547802150000
2;15;A;A;1.0;2.00;15;b15;1547803230000
2;7;C;C;1.0;5.00;7;b7;1547802730000
3;12;A;A;1.0;1.00;25;b12;1547771110000
3;10;C;C;1.0;2.00;10;b10;1547770810000
3;3;D;D;1.0;5.00;3;b3;1547770390000
104;6;C;C;1.0;6.00;6;b6;1553040610000
104;4;A;A;1.0;5.00;4;b4;1553040547000
104;1;A;A;1.0;2.00;1;b1;1553040285000
'''

df = pd.read_csv( io.StringIO(data_string), sep=";", encoding='utf-8')
# get only necessary columns ordered by timestamp
df_aux = df[['user_iD','bundle_id','correct', 'timestamp']].sort_values(by=['timestamp']) 

# hard coded new headers (possible to build from real 'bundle_id's)
df_new_headers = ['b{}'.format(x+1) for x in range(15)]
df_new_headers.insert(0, 'user_iD')

dict_answered = {}
# create a new dataframe (I'm sure that there is a more Pythonish solution)
df_new_data = []
user_ids = sorted(set( [x for label, x in df_aux.user_iD.items()]))
for user_id in user_ids:
    dict_answered[user_id] = 0
    if len( sys.argv) > 1 and sys.argv[1]:
        # supplied arg in the next line for better result readability
        df_new_values = [sys.argv[1].strip('"').strip("'")
            for x in range(len(df_new_headers)-1)]
    else:
        # zeroes (original assignment)
        df_new_values = [0 for x in range(len(df_new_headers)-1)]
    
    df_new_values.insert(0, user_id)
    df_new_data.append(df_new_values)

df_new = pd.DataFrame(data=df_new_data, columns=df_new_headers)

# fill the new dataframe using values from the original one
for aux in df_aux.itertuples(index=True, name=None):
    if aux[3] == 1.0:
        # add 1 to number of already answered questions for current user 
        dict_answered[aux[1]] += 1
        df_new.loc[ df_new["user_iD"] == aux[1], aux[2]] = dict_answered[aux[1]]    

print( df_new)

Output examples

Example: .\SO\70751715.py

   user_iD  b1  b2  b3  b4  b5  b6  b7  b8  b9  b10  b11  b12  b13  b14  b15
0        1   1   2   0   0   3   0   0   0   0    0    0    0    0    0    0
1        2   1   0   0   0   0   0   2   0   0    4    0    0    0    0    3
2        3   0   0   1   0   0   0   0   0   0    2    0    3    0    0    0
3      104   1   0   0   2   0   3   0   0   0    0    0    0    0    0    0

Example: .\SO\70751715.py .

   user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0        1  1  2  .  .  3  .  .  .  .   .   .   .   .   .   .
1        2  1  .  .  .  .  .  2  .  .   4   .   .   .   .   3
2        3  .  .  1  .  .  .  .  .  .   2   .   3   .   .   .
3      104  1  .  .  2  .  3  .  .  .   .   .   .   .   .   .

Example: .\SO\70751715.py ''

   user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0        1  1  2        3
1        2  1                 2         4                   3
2        3        1                     2       3
3      104  1        2     3

score 0 · Answer 3 · answered Jan 21 '22 at 10:51

I think you are looking for LabelEncoder. First import the library:

#Common Model Helpers
from sklearn.preprocessing import LabelEncoder

Then you should be able to convert objects to category:

    #CONVERT: convert objects to category 
    
    #code categorical data
    label = LabelEncoder()   
    dataset['question_id'] = label.fit_transform(dataset['question_id']
    dataset['user_answer'] = label.fit_transform(dataset['user_answer'])
    dataset['correct_answer'] = label.fit_transform(dataset['correct_answer'])

Or just use below:

dataset.apply(LabelEncoder().fit_transform)

user_id	b1	b2	b3	b4	b5	b6	b10	b12	b15
1	1	2	0	0	3	0	0	0	0
2	1	0	0	0	0	0	2	0	3
3	0	0	1	0	0	0	2	3	0
104	1	0	0	2	0	3	0	0	0

user_id	b1	b2	b3	b4	b5	b6	b10	b12	b15
1	1	2	0	0	3	0	0	0	0
2	1	0	0	0	0	0	2	0	3
3	0	0	1	0	0	0	2	3	0
104	1	0	0	2	0	3	0	0	0

User based encoding/convert with its interaction in pandas

3 Answers3

user_id	b1	b2	b3	b4	b5	b6	b10	b12	b15
1	1	2	0	0	3	0	0	0	0
2	1	0	0	0	0	0	2	0	3
3	0	0	1	0	0	0	2	3	0
104	1	0	0	2	0	3	0	0	0