2

I need my flink job to pull records from a database at specified interval and archive it after processing. I have implemented SourceFunction to fetch the required records from database and added the SourceFunction as the source for StreamExecutionEnvironment. How can i specify that the StreamExecutionEnvironment needs to fetch records from database by using the SourceFunction every 10 minutes?

SourceFunction:

public class MongoDBSourceFunction implements SourceFunction<List<Book>>{

    public void cancel() {
        // TODO Auto-generated method stub
    }

    public void run(org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext
<List<Book>> context) throws Exception {

        List<Book> books = getBooks();

        context.collect(books);

    }

    public List<Book> getBooks() {
        List<Book> books = new ArrayList<Book>();

        //fetch all books from database     
        return books;
    }

}

Processor:

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class ArchiveJob {

    public static void main(String[] args) {

        final StreamExecutionEnvironment env = 
    StreamExecutionEnvironment.getExecutionEnvironment();

        env.addSource(new MongoDBSourceFunction()).print();
    }

}
tweeper
  • 352
  • 1
  • 4
  • 16

1 Answers1

3

You need to add this functionality to the MongoDBSourceFunction itself. For example, you could instantiate a ScheduledExecutorService in the open method and schedule the read task using this executor.

Note, that it is important to hold the checkpointing lock while emitting records.

Till Rohrmann
  • 13,148
  • 1
  • 25
  • 51
  • The [SourceFunction for Apache NiFi flink connector](https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-nifi/src/main/java/org/apache/flink/streaming/connectors/nifi/NiFiSource.java) uses `Thread.sleep()`. Is this acceptable or using `ScheduledExecutorService ` the only way to go? – tweeper Oct 25 '18 at 05:32
  • 1
    `Thread.sleep()` should also work if you make sure that you release the checkpoint lock while sleeping. – Till Rohrmann Oct 25 '18 at 07:54
  • I am not using any checkpointing. I am simply querying the database every 15 minutes and emit those records. where does the checkpointing lock comes into picture? – tweeper Oct 25 '18 at 14:11
  • 1
    It is important to emit records under the checkpoint lock if you want to use checkpointing, because otherwise the state is not consistent. – Till Rohrmann Oct 25 '18 at 14:12