6

I have two simple Flink streaming jobs that read from Kafka do some transformations and put the result into a Cassandra Sink. They read from different Kafka topics and save into different Cassandra tables.

When I run any one of the two jobs alone everything works fine. Checkpoints are triggered and completed and data is saved to Cassandra.

But when ever I run both jobs (or one of them twice) the second job fails at start up with this exception: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/127.0.0.1] Error writing)).

I could not find much info about this error, it may be caused by any one of the following:

  • Flink (v 1.10.0-scala_2.12),
  • Flink Cassandra Connector (flink-connector-cassandra_2.11:jar:1.10.2, also tried with flink-connector-cassandra_2.12:jar:1.10.0),
  • Datastax underlying driver (v 3.10.2),
  • Cassandra v4.0 (same with v3.0),
  • Netty transport (v 4.1.51.Final).

I also use packages that may have collisions with the first ones:

  • mysql-connector-java (v 8.0.19),
  • cassandra-driver-extras (v 3.10.2)

Finally this is my code for the cluster builder:

ClusterBuilder builder = new ClusterBuilder() {
    @Override
    protected Cluster buildCluster(Cluster.Builder builder) {
        Cluster cluster = null;
        try {
            cluster = builder
                    .addContactPoint("localhost")
                    .withPort(9042)
                    .withClusterName("Test Cluster")
                    .withoutJMXReporting()
                    .withProtocolVersion(ProtocolVersion.V4)
                    .withoutMetrics()
                    .build();

            // register codecs from datastax extras.
            cluster.getConfiguration().getCodecRegistry()
                    .register(LocalTimeCodec.instance);
        } catch (ConfigurationException e) {
            e.printStackTrace();
        } catch (NoHostAvailableException nhae) {
            nhae.printStackTrace();
        }

        return cluster;
    }
};

I tried with different PoolingOptions and SocketOptions settings but no success.

Cassandra Sink:

CassandraSink.addSink(dataRows)
.setQuery("insert into table_name_(16 columns names) " +
        "values (16 placeholders);")
.enableWriteAheadLog()
.setClusterBuilder(builder)
.setFailureHandler(new CassandraFailureHandler() {
    @Override
    public void onFailure(Throwable throwable) {
        LOG.error("A {} occurred.", "Cassandra Failure", throwable);
    }
})
.build()
.setParallelism(1)
.name("Cassandra Sink For Unique Count every N minutes.");

The full trace log from flink job manager:

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/127.0.0.1] Error writing))
    at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:231)
    at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:77)
    at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1414)
    at com.datastax.driver.core.Cluster.init(Cluster.java:162)
    at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:333)
    at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:308)
    at com.datastax.driver.core.Cluster.connect(Cluster.java:250)
    at org.apache.flink.streaming.connectors.cassandra.CassandraSinkBase.createSession(CassandraSinkBase.java:143)
    at org.apache.flink.streaming.connectors.cassandra.CassandraSinkBase.open(CassandraSinkBase.java:87)
    at org.apache.flink.streaming.connectors.cassandra.AbstractCassandraTupleSink.open(AbstractCassandraTupleSink.java:49)
    at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
    at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
    at org.apache.flink.streaming.api.operators.StreamSink.open(StreamSink.java:48)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:1007)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:454)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:449)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:461)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
    at java.base/java.lang.Thread.run(Thread.java:834)

Any help is appreciated.

Edit:

  • I just tried using two Cassandra separate instances (different machines and different clusters). I then pointed one job to an instance and the other job to the other instance. Nothing has changed, I still get the same error.
  • Tried to reduce dependencies, here is the new pom file:
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.abcde.ai</groupId>
    <artifactId>analytics-etl</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>Flink Quickstart Job</name>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <flink.version>1.10.2</flink.version>
        <java.version>1.8</java.version>
        <scala.binary.version>2.11</scala.binary.version>
        <maven.compiler.source>${java.version}</maven.compiler.source>
        <maven.compiler.target>${java.version}</maven.compiler.target>
    </properties>

    <repositories>
        <repository>
            <id>apache.snapshots</id>
            <name>Apache Development Snapshot Repository</name>
            <url>https://repository.apache.org/content/repositories/snapshots/</url>
            <releases>
                <enabled>false</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
    </repositories>

    <dependencies>
        <!-- Apache Flink dependencies -->
        <!-- These dependencies are provided, because they should not be packaged into the JAR file. -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.8.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-cassandra_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>commons-configuration</groupId>
            <artifactId>commons-configuration</artifactId>
            <version>1.10</version>
        </dependency>
        <!-- Add logging framework, to produce console output when running in the IDE. -->
        <!-- These dependencies are excluded from the application JAR by default. -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.7</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
            <scope>runtime</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>

            <!-- Java Compiler -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                </configuration>
            </plugin>

            <!-- We use the maven-shade plugin to create a fat jar that contains all necessary dependencies. -->
            <!-- Change the value of <mainClass>...</mainClass> if your program entry point changes. -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.1.1</version>
                <executions>
                    <!-- Run shade goal on package phase -->
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <artifactSet>
                                <excludes>
                                    <exclude>org.apache.flink:force-shading</exclude>
                                    <exclude>com.google.code.findbugs:jsr305</exclude>
                                    <exclude>org.slf4j:*</exclude>
                                    <exclude>log4j:*</exclude>
                                </excludes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <!-- Do not copy the signatures in the META-INF folder.
                                    Otherwise, this might cause SecurityExceptions when using the JAR. -->
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>com.abcde.analytics.etl.KafkaUniqueCountsStreamingJob</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>

        <pluginManagement>
            <plugins>

                <!-- This improves the out-of-the-box experience in Eclipse by resolving some warnings. -->
                <plugin>
                    <groupId>org.eclipse.m2e</groupId>
                    <artifactId>lifecycle-mapping</artifactId>
                    <version>1.0.0</version>
                    <configuration>
                        <lifecycleMappingMetadata>
                            <pluginExecutions>
                                <pluginExecution>
                                    <pluginExecutionFilter>
                                        <groupId>org.apache.maven.plugins</groupId>
                                        <artifactId>maven-shade-plugin</artifactId>
                                        <versionRange>[3.1.1,)</versionRange>
                                        <goals>
                                            <goal>shade</goal>
                                        </goals>
                                    </pluginExecutionFilter>
                                    <action>
                                        <ignore/>
                                    </action>
                                </pluginExecution>
                                <pluginExecution>
                                    <pluginExecutionFilter>
                                        <groupId>org.apache.maven.plugins</groupId>
                                        <artifactId>maven-compiler-plugin</artifactId>
                                        <versionRange>[3.1,)</versionRange>
                                        <goals>
                                            <goal>testCompile</goal>
                                            <goal>compile</goal>
                                        </goals>
                                    </pluginExecutionFilter>
                                    <action>
                                        <ignore/>
                                    </action>
                                </pluginExecution>
                            </pluginExecutions>
                        </lifecycleMappingMetadata>
                    </configuration>
                </plugin>
            </plugins>
        </pluginManagement>
    </build>
</project>

Edit: I managed to narrow down the problem. The error gets fixed when I mark the dependency flink-connector-cassandra as provided and I simply copy the jar file from my local maven repository (~/.m2/repository/org/apache/flink/flink-connector-cassandra_2.11/1.10.2/flink-connector-cassandra_2.11-1.10.2.jar) to Flink lib folder. My problem is solved but the root cause is still a mystery.

KLiFF
  • 382
  • 2
  • 18
  • May this be related to cassandra connections limit ? Specifically, what is the `connection_pool_size` value ? – Mikalai Lushchytski Oct 13 '20 at 14:19
  • It is the default value. I tried with values like 100 and 500. Do you mean the PoolingOptions of the builder just to be sure? – KLiFF Oct 13 '20 at 14:26
  • No, I'm talking about the cassandra configuration settings in cassandra.yaml - the error that you have when you try to connect from multiple application may seem like you run out of connection pool limit in cassandra so that cassandra do not accepts any other connection. – Mikalai Lushchytski Oct 13 '20 at 14:28
  • Adding this config ```connection_pool_size:50``` causes ```Invalid yaml. Please remove properties [connection_pool_size] from your cassandra.yaml``` when I start Cassandra. – KLiFF Oct 13 '20 at 14:49
  • Do you have any config setting like `native_transport_max_concurrent_connections` ? If not, then default is unlimited. – Mikalai Lushchytski Oct 13 '20 at 14:58
  • I set ```streaming_connections_per_host: -1```, and ```native_transport_max_concurrent_connections``` and ```native_transport_max_concurrent_connections_per_ip``` are unchanged. – KLiFF Oct 13 '20 at 15:04
  • I also tried to play around with the builder ```PoolingOptions``` and ```SocketOptions``` adding more time to any timeout. This does not seem to have an impact because the delay between the task switching to running and the error is almost always the same: 3 to 4 seconds. – KLiFF Oct 13 '20 at 15:43
  • By the time you get to second job, is Cassandra up and running? Can you watch "nodetool status" output while the jobs are running – dilsingi Oct 14 '20 at 05:03
  • Hi, yes Cassandra is up and running. I can run queries using cqlsh. Also, Cassandra logs (debug, system and gc) are showing usual messages, nothing about connection loss or rejection or cluster being shutdown. – KLiFF Oct 14 '20 at 05:24
  • What is the replication factor in Cassandra for the corresponding table and what is the write consistency configured in config while pushing from flink? – Jaya Ananthram Oct 16 '20 at 17:05
  • Replication factor for the table is 1 and write consistency is ANY. I made things simple using only one cassandra node in order to eliminate any network possible congestion. – KLiFF Oct 16 '20 at 18:57
  • Could you please your env setup if possible ? Is it a local docker-compose or a remote clusters ? – Mikalai Lushchytski Oct 19 '20 at 14:43
  • Hi, Everything is located on my laptop. The error occurred initially at an on-premise 3-nodes cluster but I managed to simplify the setup and reproduce the error on my laptop. All information provided so far are about my local setup. – KLiFF Oct 19 '20 at 15:18

2 Answers2

2

I might be wrong, but most likely the issue is caused by netty client version conflict. The error states NoHostAvailableException, however the underlying error is TransportException with Error writing error message. Cassandra s definetely operating well.

There is a kind of similar stackoverflow case - Cassandra - error writing, with a very similar symptoms - a single project running well and AllNodesFailedException with TransportException with Error writing message as a root cause when adding one more. The author was able to solve it by unifying the netty client.

In your case, I'm not sure why there are so many dependencies, so I would try to get rid of all extras and libraries and would just leave Flink (v 1.10.0-scala_2.12) and Flink Cassandra Connector (flink-connector-cassandra_2.12:jar:1.10.0) libraries. They must already include necessary drivers, netty, etc. All other drivers should be skipped (at least for initial iteration to ensure that this solves the issue and it's library conflict).

Mikalai Lushchytski
  • 1,563
  • 1
  • 9
  • 18
  • Hi, I reduced my project dependencies as you suggested. I even started from a new maven project and I only use (this is the full list): ```flink-java, flink-streaming-java_2.12, flink-streaming-scala_2.12, com.google.code.gson, flink-connector-kafka_2.12, flink-connector-cassandra_2.12, commons-configuration```. I spent the afternoon inspecting the netty package deps. My guess is that the datastax driver embeds either a buggy version of it or there is a version mismatch. I tried downgrading Cassandra to 2.2.18 to check if the driver supports this version but this did not fix the problem. – KLiFF Oct 20 '20 at 17:38
  • Hi @KLiFF, could you please provide your reduced maven pom.xml file ? – Mikalai Lushchytski Oct 21 '20 at 08:59
  • Hi @Mikalai Lushchytski, I added the pom file content to the question. – KLiFF Oct 21 '20 at 09:32
  • @KLiFF, could you please clarify how you are running jobs ? Do you have a docker flink cluster setup ? What is the flink version and `flink/lib` folder content ? Thank you. – Mikalai Lushchytski Oct 21 '20 at 13:42
  • Hi, I run FLink using the Flink command $FLINK_HOME/bin/start-cluster.sh. Then I use the web UI to load the Jar file and run the job directly from the UI. This is the content of the lib folder: - flink-dist_2.12-1.10.0.jar - flink-table_2.12-1.10.0.jar - flink-table-blink_2.12-1.10.0.j - log4j-1.2.17.jar - mysql-connector-java-5.1.48.jar – KLiFF Oct 22 '20 at 08:46
  • @KLiFF, there is not a lot of chance, but at least worth trying - could you please try to remove mysql-connector-java lib from `flink/lib` and restart ? – Mikalai Lushchytski Oct 22 '20 at 08:51
  • Hi @Mikalai, I did that but did not fix the problem, I still have the same error. – KLiFF Oct 22 '20 at 09:50
  • I added a comment for an interesting finding. Please check the question at the end. – KLiFF Oct 22 '20 at 09:54
  • @KLiFF, does `flink-connector-cassandra_2.11` or `flink-connector-cassandra_2.12` resolves the issue ? Seems to be a scala version mismatch compared to local `flink/lib` jars. – Mikalai Lushchytski Oct 22 '20 at 09:58
  • they both do. As long as I mark the dependency as provided. I tried with my code using ```flink-connector-cassandra_2.12``` while I put ```flink-connector-cassandra_2.11``` on the lib folder and it works. I think this has something to do with the shading operation. – KLiFF Oct 22 '20 at 10:06
2

To fix the error I marked the dependency flink-connector-cassandra as provided and I simply copy the jar file from my local maven repository (~/.m2/repository/org/apache/flink/flink-connector-cassandra_2.11/1.10.2/flink-connector-cassandra_2.11-1.10.2.jar) to Flink lib folder and restarted Flink, here is my new pom.xml file:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-cassandra_${scala.binary.version}</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>

How I found this? I was about to try to compile the connector from source with a more recent driver version. First I tried to reproduce the error with the unchanged sources. So I compiled it without changing anything, put the jar into Flink lib folder, Hooray it works! I then suspected that the jar from maven had something different. I copied it into the lib folder and at my surprise it also worked.

My problem is solved but the root cause remains a mystery.

My last attempt was to check if any packages are in conflict with Cassandra connector so I run dependency:tree -Dverbose there was one conflict with org.apache.flink:flink-metrics-dropwizard about metrics-core:

[INFO] +- org.apache.flink:flink-connector-cassandra_2.12:jar:1.10.0:provided
[INFO] |  +- (io.dropwizard.metrics:metrics-core:jar:3.1.2:provided - omitted for conflict with 3.1.5)
[INFO] |  \- (org.apache.flink:force-shading:jar:1.10.0:provided - omitted for duplicate)

I removed this dependency from my project but the error remains if the connector is not marked as provided and also put in the lib folder.

KLiFF
  • 382
  • 2
  • 18