Optimization of bulk update/insert

Question

I'm writing a web application that is going to show player statistics for an online game, using Django 1.6 and PostgreSQL 9.1. I've created a script using django-extensions "runscript" which fetches all players that are online and insert/updates into my table. This script is executed 4 times per hour using cron. I need to either insert or update since the player already could be in the table (and thus should be updated) or not be in the table.

To my problem: there is around 25,000 players online at peak hours and I'm not really sure how I should optimize this (minimize hdd i/o). This is how I've done so far:

@transaction.commit_manually
def run():
    for fetched_player in FetchPlayers():
        defaults = {
            'level': fetched_player['level'],
            'server': fetched_player['server'],
            'last_seen': date.today(),
        }
        player, created = Player.objects.get_or_create(name=fetched_player['name'], defaults)
        if not created:
            player.level = fetched_player['level']
            if player.server != fetched_player['server']:
                # I save this info to another table
            player.server = fetched_player['server']
            player.last_seen = date.today()
            player.save()
    transaction.commit()

Would it (considerably) faster to bypass Django and access the database using psycopg2 or similar? Would Django be confused when 'someone else' is modifying the database? Note that Django only reads the database, all writes are done by this script.

What about to (using either Django or psycopg2) bulk fetch players from the database, update those that was found, and then insert those players that were not found? If this is even possible? The query would get huge: 'SELECT * FROM player WHERE name = name[0] OR name = name[1] OR ... OR name[25000]'. :)

Maybe you could make a note when inserting a player for the first time, so `fetched_player` would contain an information if the player is new and should be inserted, or it's not and should be updated - this would make things much easier. You wouldn't have to fetch data from the database at all. — Ludwik Trammer, Dec 29 '13 at 13:26
@LudwikTrammer I'm not sure I understand what you mean. FetchPlayers() just retrieves and parses a HTML-document. Where should these 'notes' be stored? — simon, Dec 29 '13 at 18:37
Is Django 'smart' enough to not execute an INSERT-query if none of an objects fields have been modified? What if I set a field to the same value that it already has, and the call .save()? I tried to put checks (just like the player.server != fetched_player['server']) for all attributes and then only call .save() if any attribute really was changed. This seems to have minimized some I/O, see: http://pastebin.com/RfWbUGYJ and http://imgur.com/cKlsLJ4 Is there any more improvements I can do? — simon, Dec 29 '13 at 18:45
Oh, ok. `FetchPlayers()` uses a data source that is out of your control - I didn't get this at first. — Ludwik Trammer, Dec 29 '13 at 21:01
As to your question about improvements. It all depends - if inserts are relatively rare (which sounds reasonable) it may be better to: 1. Check if object exists with `.exists()` 2. If it does update with `.update()` 3. If it doesn't insert with `.create()` This way you never have to fetch the whole object from you database. This **may** be faster, but the only way to make sure it is, is to try and see. — Ludwik Trammer, Dec 29 '13 at 21:03
As for "Would Django be confused when 'someone else' is modifying the database" - no, it wouldn't. As long as you don't mess with db schema it should be all good. Besides, you can make raw database queries from Django, using [`raw()`](https://docs.djangoproject.com/en/dev/topics/db/sql/#django.db.models.Manager.raw). — Ludwik Trammer, Dec 29 '13 at 21:08

score 1 · Answer 1 · answered Mar 15 '14 at 03:52

If you want to reduce the number of queries, here is what I suggest: Call update() directly for each player, which returns the number of rows updated, if the count is 0 (meaning the player is new), put the player data in a temporary list. When you are done with all fetched players, use a bulk_create() to insert all new players with one SQL statement.

Assume you have M+N players (M new, N updated), the number of queries:

Before: (M+N) selects + M inserts + N updates

After: (M+N) updates + 1 bulk insert.

Optimization of bulk update/insert

1 Answers1