I have a table with columns of different datatypes (columns like: ProductId, Name, size, color, class, dept etc) as not all columns are numeric, how can I cluster similar products together. The data sits in Netezza and for the purpose of fast processing, I want to do it on DB side only as data volume is huge (around 2 million rows).
I tried to implement Gower's similarity in R but it takes a lot of time. Is there a UDF I can use on netezza side?
dput(head(prod))
structure(list(Product_key = c("136220083", "134520094", "137520230", "133420231", "137420204", "136520284"), SRO_score = c(2, 2, 2, 3, 3, 1), PRDF_SKU_NAME = c("1496533", "1496534", "1496537", "1496540", "1496541", "1496542"), ATTRIB_VAL1 = c("Champion Canvas", "Champion Canvas", "Champion Canvas", "Champion Canvas", "Champion Canvas", "Champion Canvas"), ATTRIB_VAL2 = c("Navy Canvas", "Navy Canvas", "Red", "Red", "Red", "Red"), ATTRIB_VAL3 = c("9.5W", "10W", "7W", "8.5W", "9W", "9.5W"), ATTRIB_VAL4 = c("Keds", "Keds", "Keds", "Keds", "Keds", "Keds"), ATTRIB_VAL5 = c("VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR", "VULCANIZED FOOTWEAR"), ATTRIB_VAL6 = c("WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL", "WOMENS SPORT TRADITIONAL"), ATTRIB_VAL7 = c("1.38 lb", "1.38 lb", "1.38 lb", "1.38 lb", "1.38 lb", "1.38 lb"), ATTRIB_VAL8 = c("SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT", "SHOES WOMENS SPORT"), ATTRIB_VAL9 = c("WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES", "WOMENS SHOES")), .Names = c("Product_key", "SRO_score", "PRDF_SKU_NAME", "ATTRIB_VAL1", "ATTRIB_VAL2", "ATTRIB_VAL3", "ATTRIB_VAL4", "ATTRIB_VAL5", "ATTRIB_VAL6", "ATTRIB_VAL7", "ATTRIB_VAL8", "ATTRIB_VAL9"), row.names = c(4107L, 3927L, 4260L, 3794L, 4246L, 4140L), class = "data.frame")