1

I have a dataset of restaurant invoices, containing the products ordered bye each client.

I've already processed the data and I have the following matrix in csv file:

InvoiceID, Product 1, product 2, product 3, product 4, product 5.....
123,       0,         1,         0,         1,         0,       .....
124,       0,         1,         1,         1,         0,       .....
...

For each invoice I have an entry in the csv that contains 0 and 1 if the product in represented by column was ordered by the client (0 was not ordered, 1 was ordered).

How do I process this data with sklearn so I can cluster the invoices and get the centroids so I can see what products are in each cluster center?

Thank you!

EDIT: I have 957 products and a lot of them never were never ordered so I can reduce the matrix (dont know the best way to do it)

andrepcg
  • 1,301
  • 7
  • 26
  • 45

2 Answers2

1

Are you sure clustering is what you need?

It sounds as if market basket analysis (and frequent itemset mining) are the way to go.

Most clustering algorithms will assign every customer to exactly one type, whereas FIM will also detect subsets and overlapping patterns.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

You can use any of the clustering algorithms in scikit-learn. Take care not to pass it the ID column. You can mask the always zero columns using numpy or pandas. A good introduction to the clustering methods in scikit-learn can be found in the user guide

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74