1

I am trying to backup some data from a table with > 50M rows. The table looks like this

# 
entry_id             : int unsignedauto_increment   # unique entry id
---
time=CURRENT_TIMESTAMP : timestamp                  # current timestamp
room                 : varchar(255)                 # Room
id                   : varchar(255)                 # Sensor ID / NAME
value                : double                       # sensor value

My plan is to

  1. Fetch a limited amount of KEYS (i.e. fetch('KEY', limit=some_large_number))
  2. Retrieve table entries for those KEYS
  3. Save data chunk on disk
  4. Delete table entries belonging to KEYS
  5. Start over ...

Just fetching, let's say 1_000_000 entries, is reasonably fast (a few seconds), however if I want to go through retrieving primary keys -> fetching data -> deleting those entries - I am basically stuck at fetching the data (>20 minutes for ~ 100_000 entries or timeout(?)).

What is the most elegant / time efficient way to perform this task using datajoint logic?

(Datajoint python version '0.13.2')

Horst
  • 167
  • 1
  • 6

1 Answers1

2

If your data are not changing, then you can use the limit and offset keywords to stride through the table, fetching in chunks without deleting or restricting. There is no need to retrieve the primary key first, to restrict the fetch, or to delete the block.

step = 1_000_000
for chunk in range((len(table) + step - 1) // step):
    block = table.fetch(limit=step, offset=step * chunk)
    ... # save block

If this is a simple backup, then deletes are not necessary. If you need to delete the data, you can delete it all at once at the end.

  • Thanks, Dimitri, I wasn't aware of the `offset` parameter. I was finally able to archive all the data, however a `delete()` and a `delete_quick()` both (!) fail on this table with repeated `Lock Wait Timeout` errors (all insert / fetch commands stopped). This might be specific to our setup, I am wondering though if there is an easy solution for this (this table has no downstream tables). Is a drop() better in these cases? – Horst Jun 01 '21 at 19:23
  • deletes will time out if there is an ongoing transaction on the same range of data in the same table or in downstream tables. You can use `dj.kill()` in a separate Python session to view ongoing transactions to see what might be causing the timeouts. – Dimitri Yatsenko Jun 08 '21 at 14:55