0

I have a lot of blob data stored externally (S3) and recalculate results regularly. I am running datajoint python 0.13.8. I have read the documentation under https://docs.datajoint.org/python/admin/5-blob-config.html#cleanup

My question is about best practices for cleanup of "unused" (unlinked) blobs in the object storage.

I am a bit confused. Given an external store called octo_store what are the deletes I need to execute to delete all un-used blob data? I am mainly unsure about what schema.external['octo_store'].delete() actually does. I find the documentation for that confusing: Does the delete call delete all the files (with delete_external_files=True) or only the unlinked ones?

Horst
  • 167
  • 1
  • 6

1 Answers1

0

Let's say you have table Fit that uses an external blob stored in the store called "octo-store" declared as

@schema
class Fit(dj.Computed):
    definition = """
    -> Recording
    -> Model
    ---
    fit : blob@octo-store
    """

octo-store can be configured as an S3 bucket and folder, for example.

DataJoint will create a hidden table for tracking externally stored blobs. You can access it as schema.external['octo-store'].

When you insert a record into Fit, it is tracked using the hash of its contents in this external table. Fit makes a foreign key reference into the external table, so you cannot delete from the external table any entries that actually used.

The following command

Fit.delete()

will remove the references from Fit, but not from the external tracking table or the remote storage. This gives you high performance and data integrity at the cost of leaving the unused external data around, at least temporarily.

This means that every once in a while, you need to remove the unused entries in the external table and in the external storage. Since race conditions are not handled as precisely here as in a pure database transaction, it's best to do this in off times when the data are not actively manipulated.

The command

schema.external['octo-store'].delete(delete_external_files=True)

will remove the unused entries in the external table and the corresponding files in the external storage. This is the recommended way of clearing the data if you know that the store is only used by this database (This should be the case.)

DataJoint gives you the option of not deleting the external files

schema.external['octo-store'].delete(delete_external_files=False)

This will leave files in the remote storage that are not tracked by the database. It will become your responsibility to remove them when you choose.

  • Thanks, very helpful. I would recommend to adopt a different command name for clearing unused / unlinked files. `schema.external['octo-store'].delete()` does 100% imply to the user that the schema.external table is deleted! And not only the unused objects that are not linked in the database. This is very confusing! Also, how can I find which entries are actually hashed in schema.external['octo-store'] (how to back reference to entries in the schema?)? – Horst May 08 '23 at 18:42