8

I have a prefect server running locally (0.13 core version). I called flow.run() in a loop 1000 thousand times in a server machine with 64 GB of RAM with 32 cores of CPU. When it got up to ~300 runs, it started throwing connection refused errors from GraphQL.

I am still considering whether to use Prefect for my workflows, but it looks like it's using up way too much RAM. How does Prefect scale with thousands of workflows concurrently?

I am running the workflow with a simple example:

176 from flask import Flask
177 app = Flask(__name__)
178
179 import prefect
180 client = prefect.Client()
181
182 @app.route('/')
183 def hello_world():
184     client.create_flow_run("032275d0-6c31-4dc5-bf32-5b2afadbe531")
185     return 'Hello, World!'

Then I am calling the REST API to trigger the flow from 1 to 1000.

for i in {1..1000}; do curl localhost:5000/; done

I am getting that GraphQL is using a lot of memory (up to 10 GB RAM). Then the Prefect UI starts to hang around 100.

I am not sure if I am using Prefect workflow as its intended usage, but would like to work this out if possible.

LifeAndHope
  • 674
  • 2
  • 10
  • 27
  • 1
    Also assessing Prefect for some workflows. I'm not sure, but looking at the database schema, I might be wrong, but it does not seem like something that scales easily. The system is keeping the logs, flow run state, task run state, all in the same DB. From what I understood, UI connects to Apollo, that connects to the GraphQL client that handles all mutations and stuff. GraphQL uses Hasura do interact with Postgres. – lowercase00 Jul 05 '21 at 01:58
  • 1
    So I can imagine that a high number of requests can make things slow. I guess that maybe you could fix some of the bottlenecks from Postgres and uses a NoSQL db to speed things up. But its weird to imagine that they would rewrite a relevant aspect of the db logic, don't know... – lowercase00 Jul 05 '21 at 02:03
  • @LifeAndHope How many agents are handling these 1,000 flows? Are you running a single agent on the server? How copmutationally intensive is your flow? In a production environment, the server would just be handling the scheduling, and the agents (plural) would be horizontally running your flows. – sam-6174 Mar 18 '22 at 07:20

1 Answers1

6

The open source Prefect Server was not designed for that sort of scale; as described in this new doc, this is one of the reasons people migrate to Prefect Cloud, which is designed for scale and performance.

chriswhite
  • 1,370
  • 10
  • 21
  • But are they using Prefect correctly here? Is the limitation just because of server performance envelope or could they be configuring tasks more efficiently? – Merlin Feb 18 '21 at 04:47
  • Yea definitely -- if they have a workflow they need to run / orchestrate that is triggered via calls to this /hello Route, that's valid, and it's also valid at this scale (although most often 1k running flows in a single environment causes resource contention for CPU / processes / etc) – chriswhite Feb 24 '21 at 17:28