4

I have to process data streams from Kafka using Flink as the streaming engine. To do the analysis on the data, I need to query some tables in Cassandra. What is the best way to do this? I have been looking for examples in Scala for such cases. But I couldn't find any.How can data from Cassandra be read in Flink using Scala as the programming language? Read & write data into cassandra using apache flink Java API has another question on the same lines. It has multiple approaches mentioned in the answers. I would like to know what is the best approach in my case. Also, most of the examples available are in Java. I am looking for Scala examples.

Community
  • 1
  • 1
avidlearner
  • 243
  • 3
  • 16

1 Answers1

3

I currently read from cassandra using asyncIO in flink 1.3. Here is the documentation on it:

https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/asyncio.html (where it has DatabaseClient, you will use the com.datastax.drive.core.Cluster instead)

Let me know if you need a more in depth example for using it to read from cassandra specifically, but I unfortunately can only provide an example in java.

EDIT 1

Here is an example of the code I am using for reading from Cassandra with flink's Async I/O. I am still working on identifying and fixing an issue where for some reason (without going deep into it) for large amounts of data being returned by a single query, the async data stream's timeout is triggered even though it looks to be returned fine by Cassandra and well before the timeout time. But assuming that is just a bug with other stuff I am doing and not because of this code, this should work fine for you (and has worked fine for months for me as well):

public class GenericCassandraReader extends RichAsyncFunction<CustomInputObject, ResultSet> {

    private final Properties props;
    private Session client;

    public GenericCassandraReader(Properties props) {
        super();
        this.props = props;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        client = Cluster.builder()
                .addContactPoint(props.cassandraUrl)
                .withPort(props.cassandraPort)
                .build()
                .connect(props.cassandraKeyspace);
    }

    @Override
    public void close() throws Exception {
        client.close();
    }

    @Override
    public void asyncInvoke(final CustomInputObject customInputObject, final AsyncCollector<ResultSet> asyncCollector) throws Exception {

        String queryString = "select * from table where fieldToFilterBy='" + customInputObject.id() + "';";

        ListenableFuture<ResultSet> resultSetFuture = client.executeAsync(queryString);

        Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {

            public void onSuccess(ResultSet resultSet) {
                asyncCollector.collect(Collections.singleton(resultSet));
            }

            public void onFailure(Throwable t) {
                asyncCollector.collect(t);
            }
        });
    }
}

Again, sorry for the delay. Was hoping to have the bug resolved so I could be certain, but figured at this point just having some reference would be better than nothing.

EDIT 2

So we came to finally determine that the issue isn't with the code, but with the network throughput. Lot of bytes trying to come through a pipe that isn't large enough to handle it, stuff starts backing up, some start trickling in but (thanks to datastax cassandra driver's QueryLogger we could see this) the time it took to receive the result of each query started climbing to 4 seconds, then 6, then 8 and so on.

TL;DR, code is fine, just be aware that if you experience timeoutExceptions from Flink's asyncWaitOperator, it could be a network issue.

Edit 2.5

Also realized that it might be beneficial to mention that because of the network latency issue, we ended up moving to using a RichMapFunction that holds the data we were reading from cassandra in state. So the job just keeps track of all the records that come through it instead of having to read from the table each time a new record comes through to get all that are in there.

Jicaar
  • 1,044
  • 10
  • 26
  • Thanks Jicaar. I have been using Datastax's client in my java code (my code was migrated from Scala to Java due to a change in requirement) and it has been working fine, although I have not implemented asyncIO yet. As you mentioned in your answer, could you provide an example that implements asyncIO? – avidlearner Jun 29 '17 at 09:55
  • I will have to get back to you on it actually. I have a "working" instance, but recently started testing with larger volumes of data and it throws strange timeoutExceptions now (would get into it, but it would honestly deserve to be its own question on here). So once this is figured out I will edit the answer with the correction. – Jicaar Jun 30 '17 at 16:09
  • @Jicaar Do you read this way as a Stream (AsyncDataStream)? If so, how often this code queries Cassandra? – user1870400 Sep 07 '17 at 08:24
  • I do, and the queries are very frequent. I don't have an exact number on how often it would query and get results, but it kept up with the volume of requests from the flink job. Somewhere like 1-2 thousand requests per second. Cassandra's metics said it got a request and sent a response in about 50 milliseconds for 80% of the requests (I believe. Been a while since I have looked at the numbers). – Jicaar Sep 08 '17 at 16:06
  • @Jicaar I am new to Flink and try to use Cassandra as source and just hit the "Failing the AsyncWaitOperator" exception, can you share how it is done with RichMapFunction? btw, how come this AsyncWaitOperator exception happens when I only inserted 10 rows of data into Cassandra? – James Yu Mar 29 '18 at 20:29
  • @JamesYu there are a number of possibilities it could be. Did you post this as a question on stackoverflow? Post the link to it and I will help answer it there. – Jicaar Apr 02 '18 at 17:46
  • @Jicaar I just posted the question and here is the link "https://stackoverflow.com/questions/49625265/anyway-to-read-cassandra-as-datastream-dataset-in-flink", thank you – James Yu Apr 03 '18 at 08:14
  • Can the `RichAsyncSink` class be used to write to Cassandra? – whatsinthename Jan 11 '22 at 17:24