1

I have this dataframe which looks like this:

user_id : Represents user

question_id : Represent question number

user_answer : which option user has opted for the specific question from (A,B,C,D)

correct_answer: What is correct answer for that specific question

correct : 1.0 it means user answer is right

elapsed_time : it represents time in minutes user took to answer that question

timestamp : UNIX TIMESTAMP OF EACH INTERACTION

real_date : I have added this column and converted timestamp to human date & time

** user_*iD *** ** question_*id *** ** user_*answer *** ** correct_answer ** ** correct ** ** elapsed_*time *** ** solving_*id *** ** bundle_*id *** timestamp real_date
1 1 A A 1.0 5.00 1 b1 1547794902000 Friday, January 18, 2019 7:01:42 AM
1 2 D D 1.0 3.00 2 b2 1547795130000 Friday, January 18, 2019 7:05:30 AM
1 5 C C 1.0 7.00 5 b5 1547795370000 Friday, January 18, 2019 7:09:30 AM
2 10 C C 1.0 5.00 10 b10 1547806170000 Friday, January 18, 2019 10:09:30 AM
2 1 B B 1.0 15.0 1 b1 1547802150000 Friday, January 18, 2019 9:02:30 AM
2 15 A A 1.0 2.00 15 b15 1547803230000 Friday, January 18, 2019 9:20:30 AM
2 7 C C 1.0 5.00 7 b7 1547802730000 Friday, January 18, 2019 9:12:10 AM
3 12 A A 1.0 1.00 25 b12 1547771110000 Friday, January 18, 2019 12:25:10 AM
3 10 C C 1.0 2.00 10 b10 1547770810000 Friday, January 18, 2019 12:20:10 AM
3 3 D D 1.0 5.00 3 b3 1547770390000 Friday, January 18, 2019 12:13:10 AM
104 6 C C 1.0 6.00 6 b6 1553040610000 Wednesday, March 20, 2019 12:10:10 AM
104 4 A A 1.0 5.00 4 b4 1553040547000 Wednesday, March 20, 2019 12:09:07 AM
104 1 A A 1.0 2.00 1 b1 1553040285000 Wednesday, March 20, 2019 12:04:45 AM

I need to do some encoding , I don't know which encoding should I do and how?

What i need a next dataframe to look like this :

user_id b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 3
3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0

As you can see with the help of timestamp and real_date ; the question_id of each user is not sorted, The new dataframe should contain which of the bundles user has interacted with, time-based sorted.

Haseeb Sultan
  • 91
  • 3
  • 14
  • Your question is not totally clear. In any case, you are discussing about format of a table, not the encoding. I assume you care only the first 3 columns, right? Map the second column to add the 'b'. Set index on the first two columns (they are the index, the 3rd column is the value). Then unstack() the second index, and you are nearly done. Now you should map the values (letter -> digit). From format I assume there are multiple choice questions, in such case your format will be more tricky – Giacomo Catenazzi Jan 18 '22 at 07:47
  • @GiacomoCatenazzi : It's not about formatting a table. lemme clear this : As you can see each user has attempted different questions and in a different order. I need to encode this that each user should show which bundles he/she attempted in an orderly manner. This is just a sample dataframe. I have a huge dataset that contains 30 thousand users and around 9 million interactions. – Haseeb Sultan Jan 18 '22 at 07:51
  • *Format* as *how various data are put together*. The date 2022-01-18 is a format, not an encoding. Encoding is usually meant for a transformation on every element. Every `encode` requires a `decode`; but a format it is just a different way to show the same data. *Formatting* is a different thing: we interpret it as nice format for humans (possibly losing precision [less digits]). For this reason we tend to use *change format* (or convert) and not formatting for the first meaning. – Giacomo Catenazzi Jan 18 '22 at 07:59
  • Please [edit] your question to share a [mcve]. Questions that only ask for code are too broad and are likely to be [put on hold or closed](https://stackoverflow.com/help/closed-questions). – JosefZ Jan 20 '22 at 20:39

3 Answers3

1

First create the final value for each bundle element using groupby and cumcount then pivot your dataframe. Finally reindex it to get all columns:

bundle = [f'b{i}' for i in range(1, 16)]

values = df.sort_values('timestamp').groupby('user_iD').cumcount().add(1)

out = (
  df.assign(value=values).pivot_table('value', 'user_iD', 'bundle_id', fill_value=0)
    .reindex(bundle, axis=1, fill_value=0)
)

Output:

>>> out
bundle_id  b1  b2  b3  b4  b5  b6  b7  b8  b9  b10  b11  b12  b13  b14  b15
user_iD                                                                    
1           1   2   0   0   3   0   0   0   0    0    0    0    0    0    0
2           1   0   0   0   0   0   2   0   0    4    0    0    0    0    3
3           0   0   1   0   0   0   0   0   0    2    0    3    0    0    0
104         1   0   0   2   0   3   0   0   0    0    0    0    0    0    0

>>> out.reset_index().rename_axis(columns=None)
   user_iD  b1  b2  b3  b4  b5  b6  b7  b8  b9  b10  b11  b12  b13  b14  b15
0        1   1   2   0   0   3   0   0   0   0    0    0    0    0    0    0
1        2   1   0   0   0   0   0   2   0   0    4    0    0    0    0    3
2        3   0   0   1   0   0   0   0   0   0    2    0    3    0    0    0
3      104   1   0   0   2   0   3   0   0   0    0    0    0    0    0    0
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I have used this, I'm getting a problem some of the entries became floating points. like some entries become 2.5 , 3.5, 15.5 etc it should be only integers – Haseeb Sultan Mar 18 '22 at 05:48
1

Lacking more Pythonish experience, I'm proposing the following (partially commented) code snippet which is not optimized in any way, being based merely on elementary pandas.DataFrame API reference.

import pandas as pd
import io
import sys

data_string = '''
user_iD;question_id;user_answer;correct_answer;correct;elapsed_time;solving_id;bundle_id;timestamp
1;1;A;A;1.0;5.00;1;b1;1547794902000
1;2;D;D;1.0;3.00;2;b2;1547795130000
1;5;C;C;1.0;7.00;5;b5;1547795370000
2;10;C;C;1.0;5.00;10;b10;1547806170000
2;1;B;B;1.0;15.0;1;b1;1547802150000
2;15;A;A;1.0;2.00;15;b15;1547803230000
2;7;C;C;1.0;5.00;7;b7;1547802730000
3;12;A;A;1.0;1.00;25;b12;1547771110000
3;10;C;C;1.0;2.00;10;b10;1547770810000
3;3;D;D;1.0;5.00;3;b3;1547770390000
104;6;C;C;1.0;6.00;6;b6;1553040610000
104;4;A;A;1.0;5.00;4;b4;1553040547000
104;1;A;A;1.0;2.00;1;b1;1553040285000
'''

df = pd.read_csv( io.StringIO(data_string), sep=";", encoding='utf-8')
# get only necessary columns ordered by timestamp
df_aux = df[['user_iD','bundle_id','correct', 'timestamp']].sort_values(by=['timestamp']) 

# hard coded new headers (possible to build from real 'bundle_id's)
df_new_headers = ['b{}'.format(x+1) for x in range(15)]
df_new_headers.insert(0, 'user_iD')

dict_answered = {}
# create a new dataframe (I'm sure that there is a more Pythonish solution)
df_new_data = []
user_ids = sorted(set( [x for label, x in df_aux.user_iD.items()]))
for user_id in user_ids:
    dict_answered[user_id] = 0
    if len( sys.argv) > 1 and sys.argv[1]:
        # supplied arg in the next line for better result readability
        df_new_values = [sys.argv[1].strip('"').strip("'")
            for x in range(len(df_new_headers)-1)]
    else:
        # zeroes (original assignment)
        df_new_values = [0 for x in range(len(df_new_headers)-1)]
    
    df_new_values.insert(0, user_id)
    df_new_data.append(df_new_values)

df_new = pd.DataFrame(data=df_new_data, columns=df_new_headers)

# fill the new dataframe using values from the original one
for aux in df_aux.itertuples(index=True, name=None):
    if aux[3] == 1.0:
        # add 1 to number of already answered questions for current user 
        dict_answered[aux[1]] += 1
        df_new.loc[ df_new["user_iD"] == aux[1], aux[2]] = dict_answered[aux[1]]    

print( df_new)

Output examples

Example: .\SO\70751715.py

   user_iD  b1  b2  b3  b4  b5  b6  b7  b8  b9  b10  b11  b12  b13  b14  b15
0        1   1   2   0   0   3   0   0   0   0    0    0    0    0    0    0
1        2   1   0   0   0   0   0   2   0   0    4    0    0    0    0    3
2        3   0   0   1   0   0   0   0   0   0    2    0    3    0    0    0
3      104   1   0   0   2   0   3   0   0   0    0    0    0    0    0    0

Example: .\SO\70751715.py .

   user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0        1  1  2  .  .  3  .  .  .  .   .   .   .   .   .   .
1        2  1  .  .  .  .  .  2  .  .   4   .   .   .   .   3
2        3  .  .  1  .  .  .  .  .  .   2   .   3   .   .   .
3      104  1  .  .  2  .  3  .  .  .   .   .   .   .   .   .

Example: .\SO\70751715.py ''

   user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0        1  1  2        3
1        2  1                 2         4                   3
2        3        1                     2       3
3      104  1        2     3
JosefZ
  • 28,460
  • 5
  • 44
  • 83
0

I think you are looking for LabelEncoder. First import the library:

#Common Model Helpers
from sklearn.preprocessing import LabelEncoder

Then you should be able to convert objects to category:

    #CONVERT: convert objects to category 
    
    #code categorical data
    label = LabelEncoder()   
    dataset['question_id'] = label.fit_transform(dataset['question_id']
    dataset['user_answer'] = label.fit_transform(dataset['user_answer'])
    dataset['correct_answer'] = label.fit_transform(dataset['correct_answer'])

Or just use below:

dataset.apply(LabelEncoder().fit_transform)
yakutsa
  • 642
  • 3
  • 13