How to read from Cassandra using Apache Flink?

Question

My flink program should do a Cassandra look up for each input record and based on the results, should do some further processing.

But I'm currently stuck at reading data from Cassandra. This is the code snippet I've come up with so far.

ClusterBuilder secureCassandraSinkClusterBuilder = new ClusterBuilder() {
        @Override
        protected Cluster buildCluster(Cluster.Builder builder) {
            return builder.addContactPoints(props.getCassandraClusterUrlAll().split(","))
                    .withPort(props.getCassandraPort())
                    .withAuthProvider(new DseGSSAPIAuthProvider("HTTP"))
                    .withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM))
                    .build();
        }
    };

    for (int i=1; i<5; i++) {
        CassandraInputFormat<Tuple2<String, String>> cassandraInputFormat =
                new CassandraInputFormat<>("select * from test where id=hello" + i, secureCassandraSinkClusterBuilder);
        cassandraInputFormat.configure(null);
        cassandraInputFormat.open(null);
        Tuple2<String, String> out = new Tuple8<>();
        cassandraInputFormat.nextRecord(out);
        System.out.println(out);
    }

But the issue with this is, it takes nearly 10 seconds for each look up, in other words, this for loop takes 50 seconds to execute.

How do I speed up this operation? Alternatively, is there any other way of looking up Cassandra in Flink?

Possible duplicate of [Read data from Cassandra for processing in Flink](https://stackoverflow.com/questions/43067681/read-data-from-cassandra-for-processing-in-flink) — David Anderson, Jun 05 '18 at 11:27
Is your program meant for Batch processing or Stream processing? Are you receiving the input records as a Batch or in a Stream? — avidlearner, Jun 06 '18 at 04:15
@avidlearner The program reads data from Kafka as a stream. For every record I recieve, I should do a Cassandra look up. I have come up with a solution that works, which I will be sharing as an answer soon. But would love to know if there are more efficient ways of doing it. — Harshith Bolar, Jun 06 '18 at 08:57
Then you can use any Java Client to fetch records from Cassandra. Datastax's client can be used in a map or flatMap operator while processing your stream. CassandraInputFormat is for getting the results of a Cassandra query as a DataSet in Flink. It applies only for Batch processing. — avidlearner, Jun 06 '18 at 13:37
@avidlearner Could you please provide some examples? I searched a lot and wasn't able to find any :/ I've posted my answer now. — Harshith Bolar, Jun 06 '18 at 13:41
I use the same APIs that you have used in your example. I usually use the first solution that you have mentioned - A RichFunction and create the session in the Open method. — avidlearner, Jun 07 '18 at 10:06

Harshith Bolar · Accepted Answer · 2018-06-06T13:48:29.747

I came up with a solution that is fairly fast at querying Cassandra with streaming data. Would be of use to someone with the same issue.

Firstly, Cassandra can be queried with as little code as,

Session session = secureCassandraSinkClusterBuilder.getCluster().connect();
ResultSet resultSet = session.execute("SELECT * FROM TABLE");

But the problem with this is, creating Session is a very time-expensive operation and something that should be done once per key space. You create Session once and reuse it for all read queries.

Now, since Session is not Java Serializable, it cannot be passed as an argument to Flink operators like Map or ProcessFunction. There are a few ways of solving this, you can use a RichFunction and initialize it in its Open method, or use a Singleton. I will use the second solution.

Make a Singleton Class as follows where we create the Session.

public class CassandraSessionSingleton {
    private static CassandraSessionSingleton cassandraSessionSingleton = null;

    public Session session;

    private CassandraSessionSingleton(ClusterBuilder clusterBuilder) {
        Cluster cluster = clusterBuilder.getCluster();
        session = cluster.connect();
    }

    public static CassandraSessionSingleton getInstance(ClusterBuilder clusterBuilder) {
        if (cassandraSessionSingleton == null)
            cassandraSessionSingleton = new CassandraSessionSingleton(clusterBuilder);
        return cassandraSessionSingleton;
    }

}

You can then make use of this session for all future queries. Here I'm using the ProcessFunction to make queries as an example.

public class SomeProcessFunction implements ProcessFunction <Object, ResultSet> {
    ClusterBuilder secureCassandraSinkClusterBuilder;

    // Constructor
    public SomeProcessFunction (ClusterBuilder secureCassandraSinkClusterBuilder) {
        this.secureCassandraSinkClusterBuilder = secureCassandraSinkClusterBuilder;
    }

    @Override
    public void  ProcessElement (Object obj) throws Exception {
        ResultSet resultSet = CassandraLookUp.cassandraLookUp("SELECT * FROM TEST", secureCassandraSinkClusterBuilder);
        return resultSet;
    }
}

Note that you can pass ClusterBuilder to ProcessFunction as it is Serializable. Now for the cassandraLookUp method where we execute the query.

public class CassandraLookUp {
    public static ResultSet cassandraLookUp(String query, ClusterBuilder clusterBuilder) {
        CassandraSessionSingleton cassandraSessionSingleton = CassandraSessionSingleton.getInstance(clusterBuilder);
        Session session = cassandraSessionSingleton.session;
        ResultSet resultSet = session.execute(query);
        return resultSet;
    }
}

The singleton object is created only the first time the query is run, after that, the same object is reused, so there is no delay in look up.

How to read from Cassandra using Apache Flink?

1 Answers1