0

I am trying to work with Flink and Cassandra. Both are massively parallel environments, but I have difficulties to make them working together.

Right now I need to make an operation for parallel read from Cassandra by different token ranges with the possibility to terminate query after N objects read.

The batch mode suites me more, but DataStreams are also possible. I tried LongCounter (see below), but it would not work as I expected. I failed to get the global sum with them. Only local values.

Async mode is not nessesary since this operation CassandraRequester is performed in a parallel context with parallelization of about 64 or 128.

This is my attempt

class CassandraRequester<T> (val klass: Class<T>, private val context: FlinkCassandraContext):
        RichFlatMapFunction<CassandraTokenRange, T>() {

    companion object {
        private val session = ApplicationContext.session!!
        private var preparedStatement: PreparedStatement? = null
        private val manager = MappingManager(session)
        private var mapper: Mapper<*>? = null
        private val log = LoggerFactory.getLogger(CassandraRequesterStateless::class.java)

        public const val COUNTER_ROWS_NUMBER = "flink-cassandra-select-count"
    }

    private lateinit var counter: LongCounter

    override fun open(parameters: Configuration?) {
        super.open(parameters)

        if(preparedStatement == null)
            preparedStatement = session.prepare(context.prepareQuery()).setConsistencyLevel(ConsistencyLevel.LOCAL_ONE)
        if(mapper == null) {
            mapper = manager.mapper<T>(klass)
        }
        counter = runtimeContext.getLongCounter(COUNTER_ROWS_NUMBER)

    }

    override fun flatMap(tokenRange: CassandraTokenRange, collector: Collector<T>) {

        val bs = preparedStatement!!.bind(tokenRange.start, tokenRange.end)

        val rs = session.execute(bs)
        val resultSelect = mapper!!.map(rs)
        val iter = resultSelect.iterator()
        while (iter.hasNext()) when {
            this.context.maxRowsExtracted == 0L || counter.localValue < context.maxRowsExtracted -> {
                counter.add(1)
                collector.collect(iter.next() as T)
            }
            else -> {
                collector.close()
                return
            }
        }
    }

}

Is it possible to terminate query in such a case?

Sergey Okatov
  • 1,270
  • 16
  • 19
  • One way I could think up is try to set up an HTTP server, where we could get the accumulated counter by requesting Flink job's REST API. And in your job, instead of checking against the value of the counter, you have to request the HTTP server to get the accumulated one. – BrightFlow May 18 '18 at 01:41
  • Thank you for your suggestion. Then I suppose some specialized tool like Apache Ignite will be more preferable. I expected to solve this problem by means of Flink – Sergey Okatov May 18 '18 at 05:32

0 Answers0