I am cleaning records that have poorly recorded and inconsistent socio-demographic information over time, for the same person. I want to take the most commonly occuring value (the mode) for each person.
One way to do that is to partition by id and then count how many times each value occurs, retaining the highest count for each id:
DROP TABLE dbo.table
SELECT DISTINCT [id], [ethnic_group] AS [ethnic_mode], ct INTO dbo.table
FROM (
SELECT row_number() OVER (PARTITION BY [id] ORDER BY count([ethnic_group]) DESC) as rn, count([ethnic_group]) as ct, [ethnic_group], [id]
FROM
dbo.mytable GROUP BY [id], [ethnic_group]) ranked
where rn = 1
ORDER BY ct DESC
But I want to do this for several variables (ethnic group, income group and several more).
How can I select the mode for several variables within one statement and insert into one table (rather than creating separate tables for each variable)?
The table below illustrates an example of what I want to do:
DROP TABLE mytable;
CREATE TABLE mytable(
id VARCHAR(2) NOT NULL PRIMARY KEY
,ethnic_group VARCHAR(12) NOT NULL
,ethnic_mode VARCHAR(11) NOT NULL
,income VARCHAR(6) NOT NULL
,income_mode VARCHAR(11) NOT NULL
);
INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('id','ethnic_group','ethnic_mode','income','income_mode');
INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('1','white','white','middle','middle');
INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('1','white','white','middle','middle');
INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('1','mixed','white','high','middle');
INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('2','asian','asian','middle','middle');
INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('2','mixed','asian','middle','middle');
INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('2','asian','asian','middle','middle');