How do I clean up my join_table and remove duplicate entries?

Question

I have 2 models - Question and Tag - which have a HABTM between them, and they share a join table questions_tags.

Feast your eyes on this badboy:

1.9.3p392 :011 > Question.count
   (852.1ms)  SELECT COUNT(*) FROM "questions" 
 => 417 
1.9.3p392 :012 > Tag.count
   (197.8ms)  SELECT COUNT(*) FROM "tags" 
 => 601 
1.9.3p392 :013 > Question.connection.execute("select count(*) from questions_tags").first["count"].to_i
   (648978.7ms)  select count(*) from questions_tags
 => 39919778

I am assuming that the questions_tags join table contains a bunch of duplicate records - otherwise, I have no idea why it would be so large.

How do I clean up that join table so that it only has uniq content? Or how do I even check to see if there are duplicate records in there?

Edit 1

I am using PostgreSQL, this is the schema for the join_table questions_tags

  create_table "questions_tags", :id => false, :force => true do |t|
    t.integer "question_id"
    t.integer "tag_id"
  end

  add_index "questions_tags", ["question_id"], :name => "index_questions_tags_on_question_id"
  add_index "questions_tags", ["tag_id"], :name => "index_questions_tags_on_tag_id"

score 2 · Accepted Answer · answered Mar 13 '13 at 00:51

I'm adding this as a new answer since it's a lot different from my last. This one doesn't assume that you have an id column on the join table. This creates a new table, selects unique rows into it, then drops the old table and renames the new one. This will be much faster than anything involving a subselect.

foo=# select * from questions_tags;
 question_id | tag_id
-------------+--------
           1 |      2
           2 |      1
           2 |      2
           1 |      1
           1 |      1
(5 rows)

foo=# select distinct question_id, tag_id into questions_tags_tmp from questions_tags;
SELECT 4
foo=# select * from questions_tags_tmp;
 question_id | tag_id
-------------+--------
           2 |      2
           1 |      2
           2 |      1
           1 |      1
(4 rows)

foo=# drop table questions_tags;
DROP TABLE
foo=# alter table questions_tags_tmp rename to questions_tags;
ALTER TABLE
foo=# select * from questions_tags;
 question_id | tag_id
-------------+--------
           2 |      2
           1 |      2
           2 |      1
           1 |      1
(4 rows)

Note that you'll probably have to manually recreate any indices that the old table had. — Jim Stewart, Mar 13 '13 at 00:53
Perfect...this works. Now I am down to just 148K records. Thanks much meng! Now...the question is...how do I prevent this in the future. — marcamillion, Mar 13 '13 at 01:18
Prevent it in the future with a unique index (from a migration): add_index :questions_tags, [:question_id,:tag_id], :unique => true. You can also add a uniqueness validation in your rails model, but the index in the database is the safest way and should be there whether or not you have the uniqueness validation in your model — John Naegle, Mar 13 '13 at 01:45

Harish Shetty · Answer 2 · 2013-03-13T01:05:02.897

Delete tag associations with bad tag reference

DELETE  FROM questions_tags
WHERE   NOT EXISTS ( SELECT  1 
                 FROM    tags
                 WHERE   tags.id = questions_tags.tag_id);

Delete tag associations with bad question reference

DELETE  FROM questions_tags
WHERE   NOT EXISTS ( SELECT  1 
                 FROM    questions
                 WHERE   questions.id = questions_tags.question_id);

Delete duplicate tag associations

DELETE  FROM questions_tags
USING   ( SELECT qt3.user_id, qt3.question_id, MIN(qt3.id) id
          FROM   questions_tags qt3
          GROUP BY qt3.user_id, qt3.question_id
        ) qt2
WHERE   questions_tags.user_id=qt2.user_id AND 
        questions_tags.question_id=qt2.question_id AND
        questions_tags.id != qt2.id

Note:

Please test the SQL's in your development environment before trying them on your production environment.

By the way...the first 2 execute fine, and doesn't return any records. But when I run the last one, it doesn't even seem to run. I tried it with and without a trailing `;`. What could be causing this? — marcamillion, Mar 12 '13 at 23:43
In fact, when I add a `;` to the end of the query, I get a syntax error - https://gist.github.com/marcamillion/fc81b053c6c5928230c3 What else can I try? — marcamillion, Mar 13 '13 at 00:10

How do I clean up my join_table and remove duplicate entries?

2 Answers2