0

I'm preparting data to run KMEAMS from Graphlab, and am running into the following error:

 tmp = data.select_columns(['a.item_id'])
 tmp['sku'] = tmp['a.item_id'].apply(lambda x: x.split(','))
 tmp = tmp.unpack('sku')

 kmeans_model = gl.kmeans.create(tmp, num_clusters=K)

 Feature 'sku.0' excluded because of its type. Kmeans features must be int, float, dict, or array.array type.
 Feature 'sku.1' excluded because of its type. Kmeans features must be int, float, dict, or array.array type.

Here are the current datatypes of each column:

a.item_id   str
sku.0   str
sku.1   str

If I can get the datatype from str to int I think it should work. However, using SFrames is a more tricky than standard python libraries. Any help getting there is appreciated.

jKraut
  • 2,325
  • 6
  • 35
  • 48

1 Answers1

0

The kmeans model does allow features in dictionary form, just not in list form. This is slightly different from what you've got now, because the dictionary loses the order of your SKUs, but in terms of model quality I suspect it actually makes more sense. They key function is count_words, in the text analytics toolkit.

https://dato.com/products/create/docs/generated/graphlab.text_analytics.count_words.html

import graphlab as gl
sf = gl.SFrame({'item_id': ['abc,xyz,cat', 'rst', 'abc,dog']})
sf['sku_count'] = gl.text_analytics.count_words(sf['item_id'], delimiters=[','])

model = gl.kmeans.create(sf, num_clusters=2, features=['sku_count'])
print model.cluster_id  

+--------+------------+----------------+
| row_id | cluster_id |    distance    |
+--------+------------+----------------+
|   0    |     1      | 0.866025388241 |
|   1    |     0      |      0.0       |
|   2    |     1      | 0.866025388241 |
+--------+------------+----------------+
[3 rows x 3 columns]
papayawarrior
  • 1,027
  • 7
  • 10