kmeans implementation with mixed variable at database level

Question

I have a table with columns of different datatypes (columns like: ProductId, Name, size, color, class, dept etc) as not all columns are numeric, how can I cluster similar products together. The data sits in Netezza and for the purpose of fast processing, I want to do it on DB side only as data volume is huge (around 2 million rows).

I tried to implement Gower's similarity in R but it takes a lot of time. Is there a UDF I can use on netezza side?

dput(head(prod))

structure(list(Product_key = c("136220083", "134520094", "137520230", "133420231", "137420204", "136520284"), SRO_score = c(2, 2, 2, 3, 3, 1), PRDF_SKU_NAME = c("1496533", "1496534", "1496537", "1496540", "1496541", "1496542"), ATTRIB_VAL1 = c("Champion Canvas", "Champion Canvas", "Champion Canvas", "Champion Canvas", "Champion Canvas", "Champion Canvas"), ATTRIB_VAL2 = c("Navy Canvas", "Navy Canvas", "Red", "Red", "Red", "Red"), ATTRIB_VAL3 = c("9.5W", "10W", "7W", "8.5W", "9W", "9.5W"), ATTRIB_VAL4 = c("Keds", "Keds", "Keds", "Keds", "Keds", "Keds"), ATTRIB_VAL5 = c("VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR"), ATTRIB_VAL6 = c("WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL"), ATTRIB_VAL7 = c("1.38 lb", "1.38 lb", "1.38 lb", "1.38 lb", "1.38 lb", "1.38 lb"), ATTRIB_VAL8 = c("SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT"), ATTRIB_VAL9 = c("WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES")), .Names = c("Product_key", "SRO_score", "PRDF_SKU_NAME", "ATTRIB_VAL1", "ATTRIB_VAL2", "ATTRIB_VAL3", "ATTRIB_VAL4", "ATTRIB_VAL5", "ATTRIB_VAL6", "ATTRIB_VAL7", "ATTRIB_VAL8", "ATTRIB_VAL9"), row.names = c(4107L, 3927L, 4260L, 3794L, 4246L, 4140L), class = "data.frame")

You should share a sample of the data. Something like `dput(head(data))`. — tushaR, Aug 31 '17 at 12:34

score 0 · Answer 1 · answered Aug 31 '17 at 19:00

0

You can't just use k-means with Gower similarity.

K-means needs to compute the means, too.

The usual choice would be PAM, but that scales horribly. You don't want to use that on your full data set.

Rather than scaling to your entire data set, use a sample first to learn what to do. Clustering right is difficult. You need to spend 90% of your time on preprocessing.

First figure out what works. Then scale. Not the other way around.

answered Aug 31 '17 at 19:00

Has QUIT--Anony-Mousse

76,138
12
138
194

can u please elaborate a bit on PAM or guide me to some document about it? I have no idea about it. Thanks a lot. – Abhishek Gupta Aug 31 '17 at 22:17
Also, the custom distance metric PAM uses are Euclidean or Manhatten, none of them deal with mixed variables. – Abhishek Gupta Sep 01 '17 at 08:56
You can use PAM with Gower. Although I doubt that Gower is useful on your data either. – Has QUIT--Anony-Mousse Sep 02 '17 at 00:19

kmeans implementation with mixed variable at database level

1 Answers1