0

Hey :) i have a questions in regards to locking or mutex behavior.

Scenarios:

Lets assume the following scenarios:

  1. The pipeline is working with some local files. These files were placed by CI-CD jobs. After processing i'd like to remove the files. This would result in a race condition if the job takes longer than the schedule interval
  2. two pipelines are very resource heavy and therefore cannot be run in parallel.

possible solutions

  • Currently i would use some kind of Mutex or Lock either in a running service, where pipelines can register and are allowed to be executed or not.
  • duplicate the data to ensure that each workflow can cleanup and use their own data.
  • create a local lock file and ensure that the file will be removed if successful.
  • create a smaller schedule interval and check if lock exist. Exit cleanly if condition is not fulfilled.

I know that this might not be a normal use case for dagster, but i'd also like to use dagster for other workflows such as cleanup tasks and trigger of other pipelines.

Thanks

Thobial
  • 3
  • 1

3 Answers3

1

I'm not familiar with dagster, but a mechanism I've used successfully in other contexts is to exploit the fact that in Unix-like systems rename or mv is an atomic operation. For your first requirement of post run cleanup:

  1. New files are dropped into an input directory. A set of input files could be segregated in a directory of their own.

  2. When a pipeline process starts, its first operation is to select a file (or directory) from the input directory and mv it to a work directory owned by the pipeline instance. If no files are available in the input directory, the process shuts itself down gracefully.

  3. If the mv was successful, the process proceeds to do its thing on the file (directory) it just moved to its work directory. When it finishes, it cleans up after itself, possibly by doing a recursive delete on its work directory.

  4. If the mv fails, it means that another process grabbed the new file out from under this one. The losing process shuts itself down gracefully.

For the second requirement of only running one pipeline process at a time you could use an exclusive create of a sentinel file and have the process fail and exit if it doesn't create the sentinel file successfully. In python 3, the code might look something like

try:
    open('sentinel', 'x').close()
except FileExistsError:
    exit("someone else already has sentinel")

do_stuff()

os.remove('sentinel')

Of course, if your process crashes somewhere in do_stuff(), you'll have to clean up the sentinel file by hand, or you could use an atexit handler to ensure that the sentinel is removed even in the case of a crash in do_stuff().

Tom Barron
  • 1,554
  • 18
  • 23
  • Indeed this would exactly be my first approach to handle these kind of problem. Thanks for sharing actual code. Again i just wanted to know if this kind of feature is allready implemented or if an alternative solution is already available within dagster (https://dagster.io/) – Thobial Nov 17 '20 at 17:35
1

Thanks for sharing your use case. I don't think Dagster currently supports these features natively. However the 0.10.0 release (a few months out) will include run-level queuing, allowing you place limits on concurrent pipeline runs. Currently it only supports a global limit on runs, but soon will support adding rules based on pipeline tags (e.g. pipelines tagged 'resource-heavy' could be limited to 3 concurrent runs). It seems like that might fit this use case?

A guide to previewing the current queuing system is here. Also feel free to reach out to me on the Dagster slack at @johann!

0

A suggestion for scenario #2 (handle pipelines that are very resource heavy and that cannot be run in parallel) would be to use dagster's Celery integrations, such as the celery_executor, celery_docker_executor, or the celery_k8s_job_executor (if you're on kubernetes).

The way these work is that the Dagster pipeline run coordinator will add each solid execution task to a Celery queue, and Celery allows you to limit the number of active tasks within each queue. This is commonly used to, for example, ensure that only X number of solids are connected to Redshift at a given time.

Dagster also supports using multiple queues, so you could create one for solids that are resource intensive and another queue for solids that aren't (with has higher concurrency limits).

Regarding scenario #1, I'm not sure what your design constraints you have. One idea is to use a tagging scheme for pipeline run tags to track which pipeline run corresponds to which file; then for each file, the process that performs file clean up first verifies that there exists a successful pipeline run before deleting (by querying the runs db).

Catherine Wu
  • 356
  • 1
  • 6