0

This is an incredibly slow loop (~1.5[it/s] using tqdm to measure it)

For context, the objects refer to the model of a flask-SQLAlchemy managed postgres database which is local. ie: Network transfer speeds aren't the cause of the slow speed.

for author in tqdm(authors):
    new_score = 0
    for book in author.maintitles:
        new_score = new_score + book.score
        author.score = new_score

Further clarity: there are ~500K books and there are ~50K authors. each book can be written by several authors.

I'm not returning a list, but I'm sure this could be improved - can list comprehension actually improve this?

Something like...

[[(new_score = new_score + book.score,
            author.score = new_score) for book in author.maintitles] for author in tqdm(authors)]
lys
  • 949
  • 2
  • 9
  • 33
  • 1
    List comprehensions don't make things faster. If you don't want to build a list, don't use a list comprehension. – user2357112 Oct 03 '20 at 05:27
  • 2
    Whatever is making this loop so astonishingly slow, we can't tell from just the posted code, and we can't optimize it. – user2357112 Oct 03 '20 at 05:30
  • Thanks, added more context in the main question - hope that provides more clarity :) – lys Oct 03 '20 at 05:38
  • So authors *do* just have a ton of books? This setup sounds artificial - is this some sort of challenge problem? If so, you may be taking the wrong approach to some underlying problem you were supposed to solve with a smarter algorithm. – user2357112 Oct 03 '20 at 06:12
  • The mention of a database suggests it's not a challenge problem, but even the most prolific authors in history barely have 1000 books. – user2357112 Oct 03 '20 at 06:14
  • not a challenge problem at all, inherited someone else's codebase. approximate amount of books listed in the original post – lys Oct 03 '20 at 06:15
  • (The mention of a database also suggests that it might be better to do this aggregation in the database instead of shipping everything over to Python and doing everything in Python.) – user2357112 Oct 03 '20 at 06:16
  • `author.maintitles` looks like it might be a `relationship` property, and so possibly emits a `SELECT` for each iteration. If so, have you checked how that query performs? – Ilja Everilä Oct 05 '20 at 19:13
  • Thanks @IljaEverilä yes, it is. I might experiment with `app.config['SQLALCHEMY_ECHO'] = True` and see what's actually happening there - thanks :) – lys Oct 06 '20 at 02:52

2 Answers2

1

No, don't use a list comprehension for side effects. Even if you were going to use the list, comprehensions are only slightly faster than for-loops anyway.

However, you can improve your code with a generator expression, which is similar.

Step 1: assign to author.score at the end instead of each loop, and use augmented assignment.

for author in tqdm(authors):
    new_score = 0
    for book in author.maintitles:
        new_score += book.score
    author.score = new_score

Step 2: Now it's obvious that new_score is a simple summation, so use sum instead.

for author in tqdm(authors):
    author.score = sum(book.score for book in author.maintitles)

Sidenote: You could also write this with a list comprehension, but that would make it build the list THEN sum it, while the generator expression is more efficient because it sums as it goes.

sum([book.score for book in author.maintitles])
wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • 1
    This isn't going to make things faster either, though. 1.5 iterations per second is astonishingly slow for something as simple as the posted code - maybe there's a rate-limited API or quadratic string concatenation or something involved. – user2357112 Oct 03 '20 at 05:29
  • @user2357112 True. Though we don't know the type of any of the objects involved, so it's hard to say. In any case, this will offer at least a marginal speed boost, and it's definitely much cleaner as a side benefit. – wjandrea Oct 03 '20 at 05:30
  • (It sounds like it's 1.5 iterations per second for the outer loop, not the inner loop, but even then, unless authors have millions of books each, something weird we can't see is probably going on.) – user2357112 Oct 03 '20 at 05:32
  • Thanks, added more context in the main question - hope that provides more clarity :) – lys Oct 03 '20 at 05:39
  • @lysdexic I added more details/clarity about why not to use a list comp. If you want help with optimizing your code, ask on [codereview.se], but make sure to read their [how to ask](https://codereview.stackexchange.com/help/how-to-ask) page first. – wjandrea Oct 03 '20 at 05:53
  • thank you @wjandrea - your refactor is certainly neater too, appreciate the tip – lys Oct 03 '20 at 05:57
  • 1
    @wjandrea - sadly your refactor didn't improve the performance even slightly: | 36/44982 [00:23<8:22:06, 1.49it/s] - however i'm still marking your answer as accepted, since it's definitive that list comprehension won't help – lys Oct 03 '20 at 07:17
0

Since the refactor offered only demonstrated that list comprehension wasn't a solution - I've since discovered the root cause of the problem, so I'm adding the following as an answer.

The code snippet above is a part of an operation on a list from a returned query - as stated earlier, iterating through the deduplicated authors list (~50K authors) in the final operation was a 15 hour process at 1.5 it/s:

    # Make the popular books query
    popular_books = \
        db.session.query(Book).filter(Book.score > 0).all()
    
    # Make a list of all authors for each book returned in the query
    authors = []
    for book in popular_books:
        authors = authors + book.mainauthors
    
    # Remove duplicates using set()
    authors = list(set(authors))
    

    for author in tqdm(authors):
        author.score = sum(book.score for book in author.maintitles)
    db.session.commit()

Simply by adjusting the query to return the authors through a joinedload and handling the deduplication with .distinct() we not only simplify everything above to a couple of lines, but the operation completes in seconds after the query returns.

    for popular_author in db.session.query(Author).join(Book, Author.maintitles).options(db.joinedload(Book, Artist.maintitles)).filter(Book.popularity > 0).distinct().all():
        popular_author.score = sum(book.score for book in popular_author.maintitles)

However I'm still not entirely sure how this method can be several orders of magnitude faster than the older version. Both are iterating over lists of authors in the same way and performing the same simple summing operation.

For reference, committing the session after this process takes ~2:00 hours, where the previous implementation was much faster. Still a dramatic (7.5x) improvement in total. My guess is that using a more optimised queryfrom the beginning, all of the ORM objects returned are placed together in RAM and faster to operate on. Introducing python list methods on a query seems to break it / fragment the ORM in memory.

lys
  • 949
  • 2
  • 9
  • 33