7

I have a large number of items (1M+) that i want to delete from a database, i fork a background job to take care of that, so that the user won't have to wait for it to finish to carry on whatever he/she was doing, the problem is, the app becomes unresponsive while the items are being deleted, so i thought that i would process the items chunk by chunk and sleep for a couple of seconds then carry on.

Here is the code that handles the delete:

// laravel job class
// ...
public function handle()
{
    $posts_archive = PostArchive::find(1); // just for the purpose of testing ;)
    Post::where('arch_id', $posts_archive->id)->chunk(1000, function ($posts) {
        //go through the collection and delete every post.
        foreach($posts as $post) {
            $post->delete();
        }
        // throttle
        sleep(2);
    });
}

Expected result: the posts are chunked and each chunk is processed, then idle for 2 seconds, repeat that until all the items are deleted.

Actual result: a random number of items is deleted once, then the process ends. no errors no indicators, no clue ?

is there a better way to implement this?

xanadev
  • 751
  • 9
  • 26
  • 3
    I don't understand why did you do this way. It's faster if you run a delete command at once, instead of running 1000 commands. – Felippe Duarte Sep 24 '18 at 16:13
  • Yeah, you don't typically need to chunk a deletion, unless you're looking to do some sort of additional processing on each item as you do so. – ceejayoz Sep 24 '18 at 16:14
  • deleting a large number of items in one single operation makes the Ui unresponsive. – xanadev Sep 24 '18 at 16:15
  • 2
    Deleting them in chunks will take just as long (longer!), and the `sleep(2)` every thousand of a million items is going to mean the UI is unresponsive for ~30 minutes *literally doing nothing* on top of that. – ceejayoz Sep 24 '18 at 16:16
  • 1
    actually the `sleep(2)` is sort of not doing anything which means other processes can use the CPU, and the Ui is unresponsive because the database engine is busy processing a large number of items in one operation, so if you want to access another part of the app which uses the database then that part of the app stops responding. – xanadev Sep 24 '18 at 16:19
  • Is there some reason you need to call delete on each model? – Devon Bessemer Sep 24 '18 at 16:20
  • i can't call delete on the collection can i? – xanadev Sep 24 '18 at 16:20
  • Sure you can. `Post::where('arch_id', $posts_archive->id)->delete()`. – ceejayoz Sep 24 '18 at 16:21
  • No, you'd use the delete method on the query, not the collection – Devon Bessemer Sep 24 '18 at 16:21
  • i hope that my question is clear, the problem here is not the running time of the job itself, it's about optimizing the deletion of 1M+ items on the database, that operation is time consuming, so i though i would throttle that, deleting only a bunch of records at a time, then sleeping in between. – xanadev Sep 24 '18 at 16:22
  • The way you're currently doing it does a `SELECT *`, fetching the full contents, then individually runs a `DELETE FROM table WHERE id=123` on every row. You're issuing at least a million (plus another thousand or so) queries to do what a single `Post::where('arch_id', $posts_archive->id)->delete()` (`DELETE FROM foo WHERE arch_id=bar`) would suffice. – ceejayoz Sep 24 '18 at 16:24
  • 1
    Your "optimization" would increase load substantially by calling 1,000,000 queries instead of 1 but spread it out over a very long period of time. I wouldn't really call that optimizing. – Devon Bessemer Sep 24 '18 at 16:24
  • @ceejayoz well actually that was the first thing i did, the problem is that it takes a while to finish which makes the app unresponsive. – xanadev Sep 24 '18 at 16:24
  • for the record i already used `Post::where('arch_id', $posts_archive->id)->delete()`, that doesn't solve the problem. – xanadev Sep 24 '18 at 16:27
  • i've looked for a way to optimize deletion for large number of records, but didn't find something for laravel. – xanadev Sep 24 '18 at 16:29

4 Answers4

11

There is nothing Laravel specific about the way you'd handle this. It sounds like your database server needs review or optimization if a delete query in a job is freezing the rest of the UI.

Retrieving each model and running a delete query individually definitely isn't a good way to optimize this as you'd be executing millions of queries. You could use a while loop with a delete limit if you wish to try to limit the load per second in your application instead of optimizing your database server to handle this query:

do {
    $deleted = Post::where('arch_id', $posts_archive->id)->limit(1000)->delete();
    sleep(2);
} while ($deleted > 0);
Devon Bessemer
  • 34,461
  • 9
  • 69
  • 95
  • that's what i was looking for. maybe i'll have to optimize that database server to handle these kind of queries. thank you. – xanadev Sep 24 '18 at 16:41
10

The reason your actual outcome is different to the expected outcome is to do with how Laravel chunks your dataset.

Laravel paginates through your dataset 1-page at a time, and passes the Collection of Post models to your callback.

Since you're deleting the records in the set, Laravel effectively skips a page of data on each iteration, therefore you end up missing roughly half the data that was in the original query.

Take the following scenario – there are 24 records that you wish to delete in chunks of 10:

Expected

+-------------+--------------------+---------------------------+
|  Iteration  |   Eloquent query   | Rows returned to callback |
+-------------+--------------------+---------------------------+
| Iteration 1 | OFFSET 0 LIMIT 10  |                        10 |
| Iteration 2 | OFFSET 10 LIMIT 10 |                        10 |
| Iteration 3 | OFFSET 20 LIMIT 10 |                         4 |
+-------------+--------------------+---------------------------+

Actual

+-------------+--------------------+----------------------------+
|  Iteration  |   Eloquent query   | Rows returned to callback  |
+-------------+--------------------+----------------------------+
| Iteration 1 | OFFSET 0 LIMIT 10  |                         10 | (« but these are deleted)
| Iteration 2 | OFFSET 10 LIMIT 10 |                          4 |
| Iteration 3 | NONE               |                       NONE |
+-------------+--------------------+----------------------------+

After the 1st iteration, there were only 14 records left, so when Laravel fetched page 2, it only found 4 records.

The result, is that 14 records out of 24 were deleted, and this feels a bit random but makes sense in terms of how Laravel processes the data.

Another solution to the problem would be to use a cursor to process your query, this will step through your DB result-set 1 record at a time, which is better use of memory.

E.g.

// laravel job class
// ...
public function handle()
{
    $posts_archive = PostArchive::find(1); // just for the purpose of testing ;)
    $query = Post::where('arch_id', $posts_archive->id);

    foreach ($query->cursor() as $post) {
        $post->delete();
    }
}

NB: The other solutions here are better if you only want to delete the records in the DB. If you have any other processing that needs to occur, then using a cursor would be a better option.

Kelvin Jones
  • 399
  • 2
  • 5
  • that is a great explanation for how chunk works :) ,well i wasn't really planning on processing individual records, all i needed to do is delete them in a way that doesn't impact App responsiveness. thanks anyway. – xanadev Oct 10 '18 at 12:02
  • 1
    Any comment on whether `chunkById()` would also solve this problem? https://laravel.com/docs/master/queries#chunking-results – BlueC Dec 11 '18 at 11:04
  • 2
    @BlueC yes, that is exactly what `chunkById` is intended for – andrewtweber Feb 08 '20 at 06:12
  • Thank you for the explanation I totally forgot about the offset and limit it uses behind the scene altogether. – Anuj Shrestha Apr 17 '20 at 14:43
4

As Kelvin Jones points out, the reason the random number of items is being deleted is that you're deleting records as you page through them.

chunk simply uses offset & limit to "paginate" through your table. But if you delete 100 records from page 1 (IDs 1-100), then go to page 2, you're actually now skipping IDs 101-200 and jumping to 201-300.

chunkById is a way around this

Post::where('arch_id', $posts_archive->id)->chunkById(1000, function ($posts) {
    //go through the collection and delete every post.
    foreach($posts as $post) {
        $post->delete();
    }
});

Literally just replace the method name. Now, instead of using offset & limit to paginate, it will look at the maximum primary key (100) from the first page, then the next page will query where ID > 100. So page 2 is now correctly giving you IDs 101-200 instead of 201-300.

andrewtweber
  • 24,520
  • 22
  • 88
  • 110
  • but what if a new row is added during the process does the max id will still be maintained when the query first ran or will it change to the latest added row? – Anuj Shrestha Apr 18 '20 at 07:28
  • @AnujShrestha pretty sure it checks the max ID each time the next "chunk" is fetched, so it will include the new row. You could try it and let me know if that's wrong – andrewtweber Apr 19 '20 at 23:59
  • 1
    it looks like behind the scene it uses `->where($column, '>', $lastId) ->orderBy($column, 'asc')`. So it will include the new rows formed since it is performing on a loop to check the count. I had a problem of never-ending artisan command on schedule due to new rows being added while processing using chunk feature. I had to get the last id myself before performing the main sql query – Anuj Shrestha Apr 20 '20 at 08:20
  • @AnujShrestha I thought so. I think that all of the Laravel methods, `chunk`, `chunkById`, and `cursor` would have the same issue. You would need to get the max ID first, and then add a `where` filter. It's kind of a different topic from this question though, I think you should post a separate question about it and self-answer with what you discovered, and see if anybody has other suggestions! – andrewtweber Apr 20 '20 at 14:58
0

If i understand correctly, the issue is that deleting a large amount of entries takes too much ressources. doing it one post at a time will take too long too.

try getting the min and the max of post.id then chunk on those like

for($i = $minId; $i <= $maxId-1000; $i+1000) {
    Post::where('arch_id', $posts_archive->id)->whereBetween('id', [$i, $i+1000])->delete();
    sleep(2);
}

customize the chunk and the sleep period as it suites your server ressources.

N69S
  • 16,110
  • 3
  • 22
  • 36