0

I have a use case when I execute some calculations on a part of data and these calculations depend on the context (intermediate state).

For example: I have some orders and perform some calculations on them. Calculations are performed for orders grouped by symbol field.

    class Order {
        LocalDateTime ts;
        String symbol;
        ....
    }

So I decided to key orders by symbol field and keep separate state for each group:

    DataStream<Order> orders = tableEnv.toDataStream(selectStatement, Order.class);
    orders.keyBy(Order::getSymbol).flatMap(new SymbolExecutionContext()).addSink(jdbcSink);

Function with state:

 public class SymbolExecutionContext extends RichFlatMapFunction<Order, OrderBookRow> {

    private transient ValueState<OrderBook> orderBookState;

    @Override
    public void flatMap(Order input, Collector<OrderBookRow> out) throws Exception {
        OrderBook orderBook = this.orderBookState.value();
        if (orderBook == null) {
            orderBook = new OrderBook(input.getSymbol());
            this.orderBookState.update(orderBook);
            orderBook = this.orderBookState.value();
        }
        final List<OrderBookRow> execute = OrderBookService.execute(input, orderBook);
        for (final OrderBookRow orderBookRow : execute) {
            out.collect(orderBookRow);
        }
        this.orderBookState.update(orderBook);
    }

    @Override
    public void open(Configuration config) {
        ValueStateDescriptor<OrderBook> descriptor = new ValueStateDescriptor<OrderBook>("orderbook", OrderBook.class, null);
        orderBookState = getRuntimeContext().getState(descriptor);
    }
}

So I create new OrderBook for each unique symbol. OrderBook represents symbol context in which some calculations are performed for each Order with the same symbol.

However it seems it doesn't work. It works well if there is only one symbol. If more than one it produces invalid results (missing or not accurate) and the results are more or less unpredictive.

Flink job is executed in batch mode.

Is there a better way to handle this use case?

Ardelia Lortz
  • 72
  • 1
  • 8

2 Answers2

0

it seems you are updating order only when the state is not present. Is it a valid assumption? the incoming order wont affect the orderbook?

Amareswar
  • 2,048
  • 1
  • 20
  • 36
  • Incoming order possibly can modify OrderBook (state) and it does in `OrderBookService.execute(input, orderBook)`. `OrderBook` is a reference to the object kept in `ValueState orderBookState` so I don't have to update the state each time new Order appears. It's done once when there is no value. Even when I run `orderBook.update(orderBook)` each time new order comes in it behaves exactly in the same way. I updated the `SymbolExecutionContext` so you can see what I've changed – Ardelia Lortz Dec 23 '21 at 08:45
0

When the DataStream API is used in batch execution mode, events are sorted first by key, and then by timestamp, so that should satisfy your ordering requirement.

Since the keys are the order symbols, all of the events for the first symbol will be processed (ordered by time), followed by the next symbol, and so on. It's interesting that you say this works correctly only if there is just one symbol. What version of Flink is this? How are you setting up the table and stream execution environments, and configuring batch mode?

You shouldn't think of

orderBook = this.orderBookState.value();

as returning a reference to an object you can update. (That might accidentally be true for some of Flink's state backends, but it's not part of the public interface.) Instead, should call orderBookState.update(orderBook) every time you want to update the state.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Does `OrderBookService.execute()` expect a list of `OrderBook` records, or just the most recent? If it's a list, you need to use `ListState`, not `ValueState` – kkrugler Dec 23 '21 at 18:58
  • I wondered about that too, but OrderBook could be (or contain) a list. `ListState` will perform much better than `ValueState` on RocksDB, but for batch I believe it doesn't really matter. – David Anderson Dec 23 '21 at 20:20