0

My use case is that I have a paginated api , like http://someurl.com/next=abc , here next is a pointer to the next set of records. The api will return a pointer to the next set of records in the response, then I need to use that and pass in the next parameter of the url.

My questions are:

  • since Async I/O in Flink provides a mechanism to use HTTP client to call external APIs. How can i use to call paginated api? Using batch/ streaming mode.
  • Also I do need to put the next pointer in the database to have an audit trail of what i have processed.

Does flink allow that? I know we can use Table API or flinks batch processing mode for dataStreams.

Any one know how to do that ?

Any help is greatly appreciated.

Regards

David Anderson
  • 39,434
  • 4
  • 33
  • 60

1 Answers1

0

I assume the HTTP pagination API returns information about the value to use for next to get the next page. If so, and you're OK with the workflow running in at-least-once mode (vs. exactly once mode), then you can write a custom RichAsyncFunction that takes in a URL, and repeatedly queues up async HTTP requests (emitting results for each completed call) as it pages through that URL's result set. This assumes you have multiple URLs that you're paging through, and that you can't predict the pagination parameters in advance.

An issue with this is that if you restart the workflow, there's no checkpointed state in the async function where you can record how far through the pagination you've gotten since the previous checkpoint. So you can wind up generating duplicate results for the same URL.

If you only have a single URL, then you don't really benefit from Flink's async IO support. Just create a KeyedProcessFunction, where you save the current page parameter in a ValueState, and you should be all set.

kkrugler
  • 8,145
  • 6
  • 24
  • 18
  • thank you for your response.. can you elaborate what does at least once mode vs exactly once mode mean ... Do you have an example of such Function for me to better understand... Will Flink Handle the queuing of Async HTTP requests or do I have to do that ? I have a single URL ... its just that i get the next value from the result of the current url ... for eg: http://someurl.com/next=abc will return def Then i have to pass in http://someurl.com/next=def to get the next set of results. @kkrugler – user2386966 Sep 06 '22 at 18:28
  • 1. You should do some reading about at least once vs. exactly once processing modes in Flink, as this is a very important concept to understand. 2. If you're only paginating a single URL, then you don't need Async IO. Just create a KeyedProcessFunction, do a keyBy(url) before it, and save the current pagination parameter as state. – kkrugler Sep 07 '22 at 20:57