0

I am cleaning records that have poorly recorded and inconsistent socio-demographic information over time, for the same person. I want to take the most commonly occuring value (the mode) for each person.

One way to do that is to partition by id and then count how many times each value occurs, retaining the highest count for each id:

 DROP TABLE dbo.table
 SELECT DISTINCT [id], [ethnic_group] AS [ethnic_mode], ct INTO dbo.table
 FROM (
     SELECT row_number() OVER (PARTITION BY [id] ORDER BY count([ethnic_group]) DESC) as rn, count([ethnic_group]) as ct, [ethnic_group], [id]
     FROM 
     dbo.mytable GROUP BY [id], [ethnic_group]) ranked
     where rn = 1
 ORDER BY ct DESC

But I want to do this for several variables (ethnic group, income group and several more).

How can I select the mode for several variables within one statement and insert into one table (rather than creating separate tables for each variable)?

The table below illustrates an example of what I want to do:

 DROP TABLE mytable;
 CREATE TABLE mytable(
    id     VARCHAR(2) NOT NULL PRIMARY KEY
   ,ethnic_group VARCHAR(12) NOT NULL
   ,ethnic_mode VARCHAR(11) NOT NULL
   ,income VARCHAR(6) NOT NULL
   ,income_mode VARCHAR(11) NOT NULL
 );
 INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('id','ethnic_group','ethnic_mode','income','income_mode');
 INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('1','white','white','middle','middle');
 INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('1','white','white','middle','middle');
 INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('1','mixed','white','high','middle');
 INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('2','asian','asian','middle','middle');
 INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('2','mixed','asian','middle','middle');
 INSERT INTO mytable(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('2','asian','asian','middle','middle');
user2964644
  • 189
  • 9

1 Answers1

0

I would use sub-queries to accomplish this in 1 insert statement.

Here is an example based on the table structure from your illustration:

/* This is the original table and contains duplicate ID's */
DECLARE @source_table TABLE(
    id     VARCHAR(2) NOT NULL
   ,ethnic_group VARCHAR(12) NULL
   ,ethnic_mode VARCHAR(11) NULL
   ,income VARCHAR(6) NULL
   ,income_mode VARCHAR(11) NULL
 );

/* This is the destination table and will not contain duplicate ID's */
DECLARE @destination_table TABLE(
    id     VARCHAR(2) NOT NULL PRIMARY KEY
   ,ethnic_group VARCHAR(12) NULL
   ,income VARCHAR(6) NULL
 );

/* Populate the source table with data */
 INSERT INTO @source_table(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('1','white','white','middle','middle');
 INSERT INTO @source_table(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('1','white','white','middle','middle');
 INSERT INTO @source_table(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('1','mixed','white','high','middle');
 INSERT INTO @source_table(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('2','asian','asian','middle','middle');
 INSERT INTO @source_table(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('2','mixed','asian','middle','middle');
 INSERT INTO @source_table(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('2','asian','asian','middle','middle');
 INSERT INTO @source_table(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('3','asian', NULL, NULL, NULL);
 INSERT INTO @source_table(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('3',NULL, NULL,'middle', NULL);
 INSERT INTO @source_table(id,ethnic_group,ethnic_mode,income,income_mode) VALUES ('3',NULL, NULL, NULL, NULL);

/* Insert from source into destination (removing duplicates) */
INSERT INTO @destination_table
        (
          id
        , ethnic_group
        , income
        )
SELECT st.id
    , (
        SELECT TOP 1 ethnic_group
        FROM @source_table sub_st
        WHERE sub_st.id = st.id
        GROUP BY ethnic_group
        ORDER BY COUNT(sub_st.id) DESC
    ) 
    , (
        SELECT TOP 1 income
        FROM @source_table sub_st
        WHERE sub_st.id = st.id
        GROUP BY income
        ORDER BY COUNT(sub_st.id) DESC
    ) 
FROM @source_table st
GROUP BY st.id


/* View the destination to see there are no duplicates */
SELECT  id
      , ethnic_group
      , income
FROM @destination_table
Jesse Potter
  • 827
  • 5
  • 20
  • I want to include NULL values (so if they are mostly null, the mode is still NULL). When I tried this on real data, NULL was ignored and the mode was assigned based on non-NULLs only. – user2964644 May 23 '16 at 13:15
  • @user2964644 I have updated my answer to handle NULL values. – Jesse Potter May 23 '16 at 17:55