3

I have 2 steps in processing pipeline which runs over many images:

  • Step 1: Load locally (or download) image (IO bound)
  • Step 2: Run machine learning model (CPU/ GPU/ Compute bound/ single threaded because the model is big). How do I limit the number of images stored in memory (from step 1) queuing for the 2nd step. This is called backpressure in Reactive programming.

Without backpressure, all the work from Step 1 might pile up, leading to a high memory usage just for having images open.

I guess I could use a semaphore (e.g. of 5) which represents roughly the amount of memory I am willing to give for step 1 (5 pictures). I guess this would make 5 of my background threads to block, which is probably a bad thing? (that's a serious question: is it bad to block a background thread, since it consumes resources.)

Ben Butterworth
  • 22,056
  • 10
  • 114
  • 167
  • If you're using Combine, look into [flatMap(maxPublishers:_:)](https://developer.apple.com/documentation/combine/publishers/merge/flatmap(maxpublishers:_:)-2z9wm?changes=__2) to exert backpressure. – New Dev Jan 26 '21 at 17:26
  • For Combine, it looks like I want the flatMap to create e.g. 5 publishers (for 5 pictures in memory at a time) and then immediately create 1 publisher to generate 1 stream from these publishers. I've never used combine though. – Ben Butterworth Jan 26 '21 at 18:00
  • See https://stackoverflow.com/a/65889568/341994 and the comment discussion. Nothing is wrong with the semaphore approach, but the author of the answer supplies two other approaches. – matt Jan 26 '21 at 19:15
  • For Combine backpressure, see my https://www.apeth.com/UnderstandingCombine/operators/operatorsTransformersBlockers/operatorsflatmap.html – matt Jan 26 '21 at 19:17

2 Answers2

2

If you're using Combine, flatMap can provide the back pressure. FlatMap creates a publisher for each value it receives, but exerts back pressure when it reaches the specified maximum number of publishers that haven't completed.

Here's a simplified example. Assuming you have the following functions:

func loadImage(url: URL) -> AnyPublisher<UIImage, Error> {
   // ...
}

func doImageProcessing(image: UIImage) -> AnyPublisher<Void, Error> {
   // ...
}
let urls: [URL] = [...] // many image URLs

let processing = urls.publisher
    .flatMap(maxPublishers: .max(5)) { url in 
        loadImage(url: url)
           .flatMap { uiImage in
              doImageProcessing(image: uiImage)
           }
    }

In the example above, it will load 5 images, and start processing them. The 6th image will start loading when one of the earlier ones is done processing.

New Dev
  • 48,427
  • 12
  • 87
  • 129
  • In my case though, step 2 uses a [MLModel](https://developer.apple.com/documentation/coreml/mlmodel) where the docs state `Use an MLModel instance on one thread or one dispatch queue at a time. ` I think the key here is you're solving this with using `flatMap` **twice**, but unfortunately `maxPublishers` defaults to unlimited. Don't you want your inner `flatMap` to have `maxPublishers: 1` – Ben Butterworth Jan 26 '21 at 19:34
  • @BenButterworth, I haven't used ML Model, but as far as I see, nothing that I suggested here prevents you from doing processing on whatever queue you want. Just wrap it in a `Future` to create a publisher. As for restricting the inner `flatMap` to one - it's redundant; the inner `flatMap` only sees a single value - an image - emitted from each `loadImage` – New Dev Jan 26 '21 at 21:19
0

If you really do want to use OperationQueue, then simply set the queue's maxConcurrentOperationCount to 5 to prevent more than 5 operations from being started simultaneously.

matt
  • 515,959
  • 87
  • 875
  • 1,141
  • Actually this seems like it won't work, because sure we will cap the concurrent ops at 5, but they will keep coming (after the first 5 are done). – Ben Butterworth Jan 26 '21 at 19:21
  • Or am I wrong? I think the first 5 operations will finish, and nothing is stopping the 6th from starting, even if the 2nd step hasn't started yet. – Ben Butterworth Jan 26 '21 at 19:31
  • I'm unclear on what the overall requirement is here. What is the "2nd step"? Your description in the question is very abstract; I don't understand how the five images are used. Is the goal to use the five and then collect a new set of 5 or what? – matt Jan 26 '21 at 19:42
  • The 2nd step is the CoreML model (which is CPU/"Compute" bound and not thread safe). The issue I would have without backpressure is first step takes too much memory when the 2nd step is slow/ bottlenecks. – Ben Butterworth Jan 26 '21 at 19:44