-2

I have a table with columns of different datatypes (columns like: ProductId, Name, size, color, class, dept etc) as not all columns are numeric, how can I cluster similar products together. The data sits in Netezza and for the purpose of fast processing, I want to do it on DB side only as data volume is huge (around 2 million rows).

I tried to implement Gower's similarity in R but it takes a lot of time. Is there a UDF I can use on netezza side?

dput(head(prod))

structure(list(Product_key = c("136220083", "134520094", "137520230", "133420231", "137420204", "136520284"), SRO_score = c(2, 2, 2, 3, 3, 1), PRDF_SKU_NAME = c("1496533", "1496534", "1496537", "1496540", "1496541", "1496542"), ATTRIB_VAL1 = c("Champion Canvas", "Champion Canvas", "Champion Canvas", "Champion Canvas", "Champion Canvas", "Champion Canvas"), ATTRIB_VAL2 = c("Navy Canvas", "Navy Canvas", "Red", "Red", "Red", "Red"), ATTRIB_VAL3 = c("9.5W", "10W", "7W", "8.5W", "9W", "9.5W"), ATTRIB_VAL4 = c("Keds", "Keds", "Keds", "Keds", "Keds", "Keds"), ATTRIB_VAL5 = c("VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR"), ATTRIB_VAL6 = c("WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL"), ATTRIB_VAL7 = c("1.38 lb", "1.38 lb", "1.38 lb", "1.38 lb", "1.38 lb", "1.38 lb"), ATTRIB_VAL8 = c("SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT"), ATTRIB_VAL9 = c("WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES")), .Names = c("Product_key", "SRO_score", "PRDF_SKU_NAME", "ATTRIB_VAL1", "ATTRIB_VAL2", "ATTRIB_VAL3", "ATTRIB_VAL4", "ATTRIB_VAL5", "ATTRIB_VAL6", "ATTRIB_VAL7", "ATTRIB_VAL8", "ATTRIB_VAL9"), row.names = c(4107L, 3927L, 4260L, 3794L, 4246L, 4140L), class = "data.frame")

Abhishek Gupta
  • 77
  • 1
  • 2
  • 9

1 Answers1

0

You can't just use k-means with Gower similarity.

K-means needs to compute the means, too.

The usual choice would be PAM, but that scales horribly. You don't want to use that on your full data set.

Rather than scaling to your entire data set, use a sample first to learn what to do. Clustering right is difficult. You need to spend 90% of your time on preprocessing.

First figure out what works. Then scale. Not the other way around.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194