I am trying to use kmeans function in PL/R. I have created following table
CREATE TABLE EMP (NAME1 TEXT, AGE SMALLINT, SALARY NUMERIC );
INSERT INTO EMP VALUES
('Joe', 41, 55000),
('Jill', 27, 25000),
('Jack', 31, 45000),
('Joker', 65, 20000),
('Joy', 22, 31000),
('Jane', 72, 35000),
('Jackson', 42, 65000),
('Jessica', 23, 37000);
My clustering function is given below
CREATE OR REPLACE FUNCTION CLUS(sal NUMERIC[])
RETURNS INTEGER[] AS
$BODY$
a <- NA
a = kmeans(sal, 3)$cluster
return(a)
$BODY$
LANGUAGE 'plr' ;
When I am executing the following query
SELECT * , CLUS(ARRAY (SELECT SALARY FROM EMP ORDER BY SALARY)) AS CLUSTER_ID from emp order by salary;
My output is
name1 age salary cluster_id
Joker 65 20000 {3,3,2,2,2,2,1,1}
Jill 27 25000 {2,2,2,2,2,3,3,1}
Joy 22 31000 {3,3,2,2,2,2,1,1}
Jane 72 35000 {1,1,1,1,1,2,2,3}
Jessica 23 37000 {1,1,2,2,2,2,3,3}
Jack 31 45000 {2,2,3,3,3,3,1,1}
Joe 41 55000 {1,1,3,3,3,3,2,2}
Jackson 42 65000 {2,2,3,3,3,3,1,1}
Problems I am facing are, every time a row is evaluated clusters are changing and I want the following output …
name1 age salary cluster_id
Joker 65 20000 3
Jill 27 25000 3
Joy 22 31000 2
Jane 72 35000 2
Jessica 23 37000 2
Jack 31 45000 2
Joe 41 55000 1
Jackson 42 65000 1
please let me know if it can be done in a better and simple way