0

I am using MonetDB (MDB) for OLAP queries. I am storing source data in PostgreSQL (PGSQL) and syncing it with MonetDB in batches written in Python.

In PGSQL there is a wide table with ID (non-unique) and few columns. Every few seconds Python script takes a batch of 10k records changed in the PGSQL and uploads them to MDB.

The process of upload to MDB is as follows:

  1. Create staging table in MDB
  2. Use COPY command to upload 10k records into the staging table.
  3. DELETE from destination table all IDs that are in staging table.
  4. INSERT to the destination table all rows from staging table.

So, it is basically a DELETE & INSERT. I cannot use MERGE statement, because I do not have a PK - one ID can have multiple values in the destination. So I need to do a delete and full insert for all IDs currently synced.

Now to the problem: the DELETE is slow.

When I do a DELETE on a destination table, deleting 10k records in table of 25M rows, it will take 500ms.

However! If I run simple SELECT * FROM destination WHERE id = 1 and THEN do a DELETE, it takes 2ms.

I think that it has something to do with automatic creation of auxiliary indices. But this is where my knowledge ends.

I tried to solve this problem of "pre-heating" by doing the lookup myself and it works - but only for the first DELETE after pre-heat.

Once I do DELETE and INSERT, the next DELETE gets again slow. And doing the pre-heating before each DELETE does not make sense, because the pre-heat itself takes 500ms.

Is there any way on how to sync data to MDB without breaking auxiliary indices already built? Or make the DELETE faster without pre-heat? Or should I use some different technique to sync data into MDB without PK (does MERGE has the same problem?).

Thanks!

  • which MonetDB version are you using? – Jennie Dec 14 '20 at 09:35
  • Great question, sorry for missing it. We are running Oct2020 – Vojtěch Kurka Dec 14 '20 at 11:50
  • Thanks. this version already contains some recent improvements for HASH rebuild and handling deletes. Let me see if someone else has more ideas. PS> which OS are you using? – Jennie Dec 14 '20 at 13:54
  • PPS> Can you also provide some HW info (e.g. #CPUs, RAM)? Is this on bare-metal HW or a cloud instance? Is the timing repeatable? what's the table schema and the exact delele query? Can you provide a TRACE of the slow delete? – Jennie Dec 14 '20 at 14:06
  • We are running it inside Docker image based on python:3.7, linux server as a host (Ubuntu 20.04LTS) – Vojtěch Kurka Dec 14 '20 at 14:06
  • 8 cores, 32GB RAM, droplet in DigitalOcean. We can see the same thing in all instances (~6) we are running, they have different volumes of data, but same thing happening. – Vojtěch Kurka Dec 14 '20 at 14:08
  • I dont' have much Docker experiences myself, but next than that, all look very normal. Would be interesting to see where the time is spent, so, a first thing is to look at the DELETE query trace. – Jennie Dec 14 '20 at 14:11
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225943/discussion-between-vojtch-kurka-and-jennie). – Vojtěch Kurka Dec 14 '20 at 16:07

0 Answers0