Since the refactor offered only demonstrated that list comprehension wasn't a solution - I've since discovered the root cause of the problem, so I'm adding the following as an answer.
The code snippet above is a part of an operation on a list
from a returned query
- as stated earlier, iterating through the deduplicated authors
list (~50K authors) in the final operation was a 15 hour process at 1.5 it/s:
# Make the popular books query
popular_books = \
db.session.query(Book).filter(Book.score > 0).all()
# Make a list of all authors for each book returned in the query
authors = []
for book in popular_books:
authors = authors + book.mainauthors
# Remove duplicates using set()
authors = list(set(authors))
for author in tqdm(authors):
author.score = sum(book.score for book in author.maintitles)
db.session.commit()
Simply by adjusting the query to return the authors through a joinedload
and handling the deduplication with .distinct()
we not only simplify everything above to a couple of lines, but the operation completes in seconds after the query returns.
for popular_author in db.session.query(Author).join(Book, Author.maintitles).options(db.joinedload(Book, Artist.maintitles)).filter(Book.popularity > 0).distinct().all():
popular_author.score = sum(book.score for book in popular_author.maintitles)
However I'm still not entirely sure how this method can be several orders of magnitude faster than the older version. Both are iterating over lists of authors
in the same way and performing the same simple summing operation.
For reference, committing the session after this process takes ~2:00 hours, where the previous implementation was much faster. Still a dramatic (7.5x) improvement in total. My guess is that using a more optimised query
from the beginning, all of the ORM objects returned are placed together in RAM and faster to operate on. Introducing python list
methods on a query
seems to break it / fragment the ORM in memory.