7

I have Post and User models. Every post belongs to a user. However, during database import, some erroneous user_ids have been entered on some posts. What would be the query to get posts whose user_ids don't refer to any user? Thanks.

Sanjay Singh
  • 367
  • 1
  • 11

7 Answers7

18

As @user2553863 already mentioned, Rails 5 added support for left_outer_joins, meaning you can now do this in an efficient manner and without writing any SQL like this:

Post.left_outer_joins(:user).where(users: {id: nil}).delete_all

This will find any orphaned posts (those without users) and delete them. Here, user is the association name and users is the name of the joined table. You do not have to fire an additional SELECT to query all the user ids, which might break when you have many users.

amoebe
  • 4,857
  • 5
  • 37
  • 42
7

Sanjay,

all the solutions proposed works fine for small tables, but depending on the size of the tables involved, and the amount of memory and processing power available, for performance reasons, you might want to use a LEFT OUTER JOIN, like this:

Post.joins("LEFT OUTER JOIN users ON posts.user_id = user.id")
    .where("user.id IS NULL")

In Rails 5, there is support for LEFT OUTER JOIN in ActiveRecord.

Regards

user2553863
  • 682
  • 1
  • 8
  • 17
7

RAILS 6.1+

You can make use of the 'missing' method to fetch orphaned records. eg

Class User
end

Class Post
  belongs_to :user
end

Here is when missing method is used

Post.where.missing(:user)

This will fetch the all Post records which have a user_id but the corresponding user being deleted.

JonesGladston
  • 295
  • 1
  • 3
  • 11
4

I would do the following, which will result in one SELECT and one DELETE statement (so 2 queries in total)

Post.where('`posts`.`user_id` NOT IN (?)', User.pluck(:id)).delete_all
Manuel van Rijn
  • 10,170
  • 1
  • 29
  • 52
  • Wouldn't it be TWO selects? One for pluck, and one for the "where" statement? – user2553863 Mar 31 '17 at 12:26
  • Found an interesting caveat to this one- this works only if there exists some non-orphan records. If they are ALL orphans, ActiveRecord ends up doing a query with the clause `posts.user_id NOT IN (NULL)`, which actually returns nothing even if records exist. The join recommendation @user2553863 states below has been the most reliable. – abegosum Sep 10 '19 at 21:34
4

Thanks to both of you. My solution was similar to Manuel's

all_user_ids = User.all.pluck(:id)
unwanted_posts = Post.where.not(:user_id => all_user_ids)

Then I could destroy all unwanted_posts. Of course, the other solution would work as well.

Sanjay Singh
  • 367
  • 1
  • 11
3

Note: answers below are valid for Rails 5.0

Many of these answers work fine for a few records or on small tables, but don't scale well at all to having a large amount of orphaned records or when dealing with large tables.

For instance, dealing with two larger tables where ModelOne has 707,891 orphaned records:

irb(main):032:0> ModelOne.count
=> 2,265,216
irb(main):033:0> ModelTwo.count
=> 5,109,186

Trying to execute a query with NOT IN will fail as it's too large:

irb(main):029:0> ModelOne.where.not(model_two_id: ModelTwo.pluck(:id))
ActiveRecord::StatementInvalid (Mysql2::Error: MySQL server has gone away: SELECT `model_ones`.* FROM `model_ones` WHERE (`model_ones`.`model_two_id` NOT IN (12068663, 12076647, 12076648, 12082392, 12082393, 12082394, <repeat for the other 5 million ModelTwo records>))

Additionally, trying to call .delete_all on a query that uses left_outer_joins doesn't quite work as expected.

This is the SQL rails generates for ModelOne.left_outer_joins(:model_two).where(model_twos: {id: nil}):

SELECT `model_ones`.* FROM `model_ones`
LEFT OUTER JOIN `model_twos` ON `model_twos`.`id` = `model_ones`.`model_two_id`
WHERE `model_twos`.`id` IS NULL

But chaining .delete_all to the end (ModelOne.left_outer_joins(:model_two).where(model_twos: {id: nil}).delete_all) generates:

DELETE FROM `model_ones` WHERE `model_twos`.`id` IS NULL

Which will throw an error.

The most performant way I've found to delete orphaned records comes from this answer and uses SQL EXISTS and a nested query to efficiently find and delete orphaned records.

ModelOne.where.not(
  ModelTwo.where('model_twos.id = model_ones.model_two_id').exists
)

Which generates:

SELECT `model_ones`.* FROM `model_ones`
WHERE (
  NOT (
    EXISTS (
      SELECT `model_twos`.* FROM `model_twos` WHERE (model_twos.id = model_ones.model_two_id)
    )
  )
)

Using this query to load the 707,891 orphaned records takes just under a minute:

irb(main):040:0> Benchmark.measure { ModelOne.where.not(ModelTwo.where('model_twos.id = model_ones.model_two_id').exists).load }
=> #<Benchmark::Tms:0x0000563cfa227580 @label="", @real=59.61208474007435, @cstime=0.0, @cutime=0.0, @stime=0.23068100000000014, @utime=49.025859000000025, @total=49.25654000000002>

Chaining .delete_all to this query will work as expected and delete all the orphaned records

ModelOne.where.not(ModelTwo.where('model_twos.id = model_ones.model_two_id').exists).delete_all

generates the SQL:

  DELETE FROM `model_ones` WHERE (NOT (EXISTS (SELECT `model_twos`.* FROM `model_twos` WHERE (model_twos.id = model_ones.model_two_id))))
1

I don't think you can do this with straight AR, but it is fairly easy to fix with a little Ruby:

Post.find_each { |p| p.delete if p.user.nil? }

EDIT: forgot .all doesn't return an ActiveRecord::Relation

tagCincy
  • 1,589
  • 10
  • 20