os.listdir vs grep in a Prefect schedule

Question

I'm scheduling tasks with Prefect this way :

#Python script
from prefect import task, Flow
from prefect.tasks.shell import ShellTask
from datetime import timedelta
from datetime import datetime
from prefect.schedules import IntervalSchedule
import os
import sys

schedule = IntervalSchedule(start_date=datetime.now() + timedelta(seconds=10),interval=timedelta(minutes=1))
can_start = True

with Flow("List files", schedule) as flow:
    
    if can_start:
        can_start = False
        file_names = os.listdir("/home/admin/data/raw")
        file_names = fnmatch.filter(file_names, "*fact*")
        process_common.map(file_names)
        can_start = True
    
out = flow.run()

But if files arrive into my directory after the first Prefect run, file_names remain empty during the second run, and also during all the next ones.

I have tried to fetch my files with a grep command, and then it works !

file_names = ShellTask(command="ls /home/admin/data/raw | grep fact", return_all=True, log_stderr=True, stream_output=True)

Does someone know why that happens ? Many thanks for your help.

Can you clarify your question? If there are no files when listing files, of course there will not be any files listed. — MisterMiyagi, Jan 26 '21 at 10:30
there are no files at the first run. Then I put files between run1 and run2. Run2 does not find anything. Is it clearer ? I'll add precisions to my question. — Pauline, Jan 26 '21 at 10:54

score 1 · Answer 1 · answered Jan 27 '21 at 19:24

This is a common confusion point - you are conflating build-time logic with runtime logic (see this SO post for another example).

All logic that you want to have effect at runtime should be encapsulated as a Prefect task - in your case, you may need to use Prefect's conditional tasks to achieve your outcome, although you might be able to get away with something much simpler.

In particular, the following code seems to have the desired outcome:

@task
def get_filenames():
    file_names = os.listdir("/home/admin/data/raw")
    file_names = fnmatch.filter(file_names, "*fact*")
    return file_names


with Flow("List files", schedule) as flow:
    process_common.map(file_names) # if the list is empty, nothing will happen
    
out = flow.run()

Lastly, note that you can effectively mark tasks as "skipped" based on dynamic runtime conditions using SKIP signals.

os.listdir vs grep in a Prefect schedule

1 Answers1