5

I'm working on simple html scraper in Python 3.4, using peewee as ORM (great ORM btw!). My script takes a bunch of sites, extract necessary data and save them to the database, however every site is scraped in detached process, to improve performance and saved data should be unique. There can be duplicate data not only between sites, but also on particular site, so I want to store them only once.

Example: Post and Category - many-to-many relation. During scraping, same category appears multiple times in different posts. For the first time I want to save that category to database (create new row). If the same category shows up in different post, I want to bind that post with already created row in db.

My question is - do I have to use atomic updates/inserts (insert one post, save, get_or_create categories, save, insert new rows to many-to-many table, save) or can I use bulk insert somehow? What is the fastest solution to that problem? Maybe some temporary tables shared between processes, which will be bulk insert at the end of work? Im using MySQL db.

Thx for answers and your time

Paweł Stysz
  • 53
  • 1
  • 4

2 Answers2

1

You can rely on the database to enforce unique constraints by adding unique=True to fields or multi-column unique indexes. You can also check the docs on get/create and bulk inserts:

coleifer
  • 24,887
  • 6
  • 60
  • 75
  • Ok, after couple of tests, i've ended up with normal save on post and category table + bulk insert on post_to_category table. Didn't expect help from peewee guru, thx m8 ;) – Paweł Stysz Jan 04 '15 at 19:37
  • 3
    According to the FAQ, you shouldn't just put a bunch of links on an answer. – maugch Aug 13 '16 at 16:00
  • Fail on my part. Multiple processes with sqlite must be handled differently than postgres or MySQL, so that's one consideration. The other would be to use the `atomic` context manager to wrap inserts in transactions, rolling back and/or retrying on failure. – coleifer Aug 15 '16 at 01:31
  • 1
    I am still not following the answer. If there are multiple records, which I have to insert and I need to check that the record should not exist. Which is the best method? As per what I am thinking, the only option is to get unique keys not matching from the database table and skip those records while inserting. Is there a better way around? – kewal Oct 24 '16 at 19:30
1

Looked for this myself for a while, but found it!

you can use the on_conflict_replace() or on_conflict_ignore() functions to define behaviour for when a record exists in a table that has a uniqueness constraint.

PriceData.insert_many(values).on_conflict_replace().execute()

or

PriceData.insert_many(values).on_conflict_ignore().execute()

More info under "Upsert" here