0

Imagine the following pattern:

execute(): The execute() method is called once and a reactor code scans the entire directory, does parse files and puts the result into a blocking queue.

fetch(): The method fetch() is called multiple times from an external process and takes the next 1000 rows from above queue.

In standard Java I would implement that with one thread writing into an ArrayBlockingQueue and the other thread reading the queue. But how can that be done efficiently and savely using Reactor as producer of the data?

Requirements:

  1. The read will be slower than the data producer and I don't want the queue to fill up too much. Hence a blocking queue.
  2. It is possible to stop the process at any time, e.g. the producer found 100000 rows but the fetch was reading just 100 rows and then decided it had all the data needed. Calling the dispose() method should stop all producer threads immediately, even if this Flux is waiting.
  3. The fetch() has the means to know that all data has been read.

I understand what a producer/subscriber pattern is, but the subscriber would be a constantly running thread. Not something I can call and it gets me the next record when I am ready for it.

So it is sort of a push/pull. The producer pushes new data and there shall be a method with which I can pull the next row off the queue in my own time.

Any thoughts?

Producer (simplified):

DataLakeFileSystemAsyncClient asyncfsclient = asyncclient.getFileSystemAsyncClient(name);
ListPathsOptions options = new ListPathsOptions();
options.setPath("/");
options.setRecursive(true);
Flux<PathItem> items = asyncfsclient.listPaths(options).take(10000);
Werner Daehn
  • 577
  • 6
  • 16

1 Answers1

1

You may implement a custom Publisher for that. Such a publisher could have:

  • an internal buffer filled from the external fetch()
  • fill that buffer only when it's empty
  • produce to the subscriber only data it requested

Something like (may miss some parts but you should get the idea):

class MyDataPublisher extends Publisher<MyData> {

  private List<MyData> currentPage = null
  private int index = 0

  private final MyDataClient client;

  constructor MyDataPublisher(client: MyDataClient) {
     this.client = client;
  }

  void subscribe(Subscriber<MyData> s) {
      // those two just to control the flow
      AtomicBoolean cancelled = new AtomicBoolean(false);
      AtomicLong limit = new AtomicLong(0);

      // start when subscribed
      s.onSubscribe(Subscription {
          void request(Long n) {
              limit.set(n);
              scan(cancelled, limit, s);
          }

          void cancel() {
              cancelled.set(true);
          }
      })
  }

  void scan(AtomicBoolean cancelled, AtomicLong limit, Subscriber<MyData> s) {
     // make sure you have some data loaded already
     if (currentPage == null) {
         currentPage = myDataClient.fetch(1000);
     }

     while (!cancelled.get() && limit.get() > 0) {
         // publish all current items to the subscriber
         while (limit.get() > 0 && currentPage.size > index) {
             val next = currentPage.get(index);
             index += 1;
             s.onNext(next);
             limit.decrementAndGet();
         }
         // when all existing data published load a next batch
         if (index >= currentPage.size) {
             currentPage = myDataClient.fetch(1000);
             index = 0;

             // HERE YOU SHOULD ALSO STOP IF NO MORE DATA AVAILABLE
             // LIKE:
             if (currentPage.isEmpty) {
                s.onComplete();
                cancelled.set(true);
             }
         }
     }
  } 


}

And then use that publisher as:

Flux<PathItem> items = Flux.from(new MyDataClient()).take(10000)
Igor Artamonov
  • 35,450
  • 10
  • 82
  • 113