0

I have a dataset on which I am trying to determine association rules. The data after the merging and mapping is as follows:

Transaction data snapshot

Following this reference: Market Basket Analysis in Python. I see that I can use the groupby method to group data using the order ID and using this command:

basket = df_order_mapped.groupby(['order_id']).sum().unstack()

I am able to group every thing by the order_id with no spaces between the individual products bought. However, I am clueless from here on in as to how to perform one hot encoding as done in the reference. The reference uses the command:

basket = (df[df['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

Even though I have tried to understand each individual command one by one but I can't seem to get my head around things. Just as a test I tried to use groupby with both the order_id and product_id but I get the error:

IndexError: index 838323453 is out of bounds for axis 0 with size 838322411

The number of rows is 3m and the total number of potential products is 25000.

I would be grateful if someone can help me with this.

Thanks in advance.

Syeman
  • 57
  • 6
  • instead you should use `from sklearn.preprocessing import OneHotEncoder` read more here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html – YOLO Dec 01 '18 at 18:57
  • What are you trying to obtain? A oneHot vector for each combination order_id/product_id? – yatu Dec 01 '18 at 19:14
  • Yes, I have tried the onehotencoder. What I am confused on is if I one hot encode the data before using groupby, how will I merge it in to one row while still keeping the columns – Syeman Dec 01 '18 at 21:00
  • @Nixon yes. I am trying to obtain a file where each value in product_id is one hot encoded and they are merged according to order_id – Syeman Dec 01 '18 at 21:51

0 Answers0