I am trying to optimize my code by reducing the calls to the Database. I have the following models:
class PageCategory(Base):
category_id = Column(Text, ForeignKey('category.category_id'), primary_key=True)
page_id = Column(Text, ForeignKey('page.page_id'), primary_key=True)
class Category(Base):
category_id = Column(Text, primary_key=True)
name = Column(Text, nullable=False)
pages = relationship('Page', secondary='page_category')
class Page(Base):
page_id = Column(Text, primary_key=True)
name = Column(Text, nullable=False)
categories = relationship('Category', secondary='page_category')
The code receives a stream of Facebook likes and each one comes with a Page
a Category
and the obvious relation between them a PageCategory
. I need to find a way to bulk create, if not existing already, the different Pages, Categories and the relation between them. Given that the code needs to be fast I can't afford a round trip to the Database when creating every object.
page = Page(page_id='1', name='1')
category = Category(category_id='2', name='2')
session.add(page)
session.add(category)
session.commit()
...same for PageCategory
Now, given that a page_id
and category_id
are PK, the database will raise an IntegrityError
if we try to insert duplicates, but that is still a round-trip dance. I would need a utility that receives, say a list of objects like session.bulk_save_objects([page1, page2, category1, category2, page_category1, page_category2])
but just create the objects that do not raise an IntegrityError, and ignore the ones that do.
This way I will be avoiding Database IO for every triple of objects. I don't know if this is possible or this exceeds SQLAlchemy capabilities.