pio train fails with IOException: Connection reset by peer

Question

I've done a setup of predictionIO v0.13 on my linux machine in docker (running in swarm mode). This setup includes:

one container for pio v0.13
one container for elasticsearch v5.6.4
one container for mysql v8.0.16
one container for spark-master v2.3.2
one container for spark-worker v2.3.2

The template I am using is the ecomm-recommender-java, modified for my data. I don't know if I made an error with the template or with the docker setup, but there is something really wrong:

pio build succeeds
pio train fails - with Exception in thread "main" java.io.IOException: Connection reset by peer

Because of this, I put a lot of logging into my template for various points, and this is what I found:

The train fails after the model is computed. I am using a custom Model class, for holding the logistic-regression model and the various user and product indices.
The model is a PersistentModel. In the save method I put logging after every step. Those are logged, and I can find the saved results in the mounted docker volume, so it seems like the save also succeeds, but after that I get the following exception:

[INFO] [Model] saving user index
[INFO] [Model] saving product index
[INFO] [Model] save done
[INFO] [AbstractConnector] Stopped Spark@20229b7d{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
Exception in thread "main" java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:197)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:204)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
    at org.apache.predictionio.shaded.org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
    at java.lang.Thread.run(Thread.java:748)

I couldn't find any more relevant in any of the logs, but that's a possibility that I overlooked something.

I tried to play with the train parameters like so: pio-docker train -- --master local[3] --driver-memory 4g --executor-memory 10g --verbose --num-executors 3

playing with the spark modes (i.e.: --master local[1-3], and not providing that to use the instances in the docker containers)
played with the --driver-memory (from 4g to 10g)
played with the --executor-memory (also from 4g to 10g)
played with the --num-executors number (from 1 to 3)

As most of the google search results are suggested these. My main problem here is that I don't know from where this exception is coming and how to discover it.

Here is the save and method, which could be relevant:

    public boolean save(String id, AlgorithmParams algorithmParams, SparkContext sparkContext) {
        try {
            logger.info("saving logistic regression model");
            logisticRegressionModel.save("/templates/" + id + "/lrm");
            logger.info("creating java spark context");
            JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
            logger.info("saving user index");
            userIdIndex.saveAsObjectFile("/templates/" + id + "/indices/user");
            logger.info("saving product index");
            productIdIndex.saveAsObjectFile("/templates/" + id + "/indices/product");
            logger.info("save done");
        } catch (IOException e) {
            e.printStackTrace();
        }
        return true;
    }

The hardcoded /templates/ is the docker-mounted volume for pio and for spark also.

Expected result is: train completes without error. I am happy to share more details if necessary, please ask for them, as I am not sure what could be helpful here.

EDIT1: Including docker-compose.yml

version: '3'

networks:
    mynet:
        driver: overlay

services:

    elasticsearch:
        image: docker.elastic.co/elasticsearch/elasticsearch:5.6.4
        environment:
          - xpack.graph.enabled=false
          - xpack.ml.enabled=false
          - xpack.monitoring.enabled=false
          - xpack.security.enabled=false
          - xpack.watcher.enabled=false
          - cluster.name=predictionio
          - bootstrap.memory_lock=false
          - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
        volumes:
          - pio-elasticsearch-data:/usr/share/elasticsearch/data
        deploy:
            replicas: 1
        networks:
            - mynet

    mysql:
        image: mysql:8
        command: mysqld --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci
        environment:
          MYSQL_ROOT_PASSWORD: somepass
          MYSQL_USER: someuser
          MYSQL_PASSWORD: someotherpass
          MYSQL_DATABASE: pio
        volumes:
          - pio-mysql-data:/var/lib/mysql
        deploy:
            replicas: 1
        networks:
            - mynet

    spark-master:
        image: bde2020/spark-master:2.3.2-hadoop2.7
        ports:
          - "8080:8080"
          - "7077:7077"
        volumes:
            - ./templates:/templates
        environment:
          - INIT_DAEMON_STEP=setup_spark
        deploy:
            replicas: 1
        networks:
            - mynet

    spark-worker:
        image: bde2020/spark-worker:2.3.2-hadoop2.7
        depends_on:
          - spark-master
        ports:
          - "8081:8081"
        volumes:
            - ./templates:/templates
        environment:
          - "SPARK_MASTER=spark://spark-master:7077"
        deploy:
            replicas: 1
        networks:
            - mynet

    pio:
        image: tamassoltesz/pio0.13-spark.230:1
        ports:
            - 7070:7070
            - 8000:8000
        volumes:
            - ./templates:/templates
        dns: 8.8.8.8
        depends_on:
          - mysql
          - elasticsearch
          - spark-master
        environment:
          PIO_STORAGE_SOURCES_MYSQL_TYPE: jdbc
          PIO_STORAGE_SOURCES_MYSQL_URL: "jdbc:mysql://mysql/pio"
          PIO_STORAGE_SOURCES_MYSQL_USERNAME: someuser
          PIO_STORAGE_SOURCES_MYSQL_PASSWORD: someuser
          PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME: pio_event
          PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE: MYSQL
          PIO_STORAGE_REPOSITORIES_MODELDATA_NAME: pio_model
          PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE: MYSQL
          PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE: elasticsearch
          PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS: predictionio_elasticsearch
          PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS: 9200
          PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES: http
          PIO_STORAGE_REPOSITORIES_METADATA_NAME: pio_meta
          PIO_STORAGE_REPOSITORIES_METADATA_SOURCE: ELASTICSEARCH
          MASTER: spark://spark-master:7077 #spark master
        deploy:
            replicas: 1
        networks:
            - mynet

volumes:
    pio-elasticsearch-data:
    pio-mysql-data:

Welcome to StackOverflow! IMHO - This is far more likely to be a networking issue than a code issue. Of course your networking is in Docker scripts, so it is code! Could you include your docker-compose.yml, or a similar example? This should help identify networking issues between the containers. — rbrtl, May 03 '19 at 10:02
Thank you! I edited the question, you can find the docker-compose file there — tamassoltesz, May 03 '19 at 11:03
I am by no means an expert in Docker - but I need to learn it, thus any excuse is a good one. Looking at the docker-compose file I'm confused by the host names: how do each of your containers know what the others are called? `- "SPARK_MASTER=spark://spark-master:7077"` how does this line end up pointing to the spark master container? — rbrtl, May 03 '19 at 13:10
In swarm mode docker does the load balancing for the services and also the discovery. Long story short, they can find each other because they are all part of the same network (`mynet` in the example), and the spark-master's service name is `spark-master`. — tamassoltesz, May 03 '19 at 13:27
Just found the docs for the network links a quick sanity check: do you need quotes around the `MASTER` environment variable in the `pio` container? — rbrtl, May 03 '19 at 13:34
Hi Tomas, I can't see the log line for the SparkContext. Has this been successfully created? I know less about Spark than I do about Docker (I really hoped someone would have come along to help by now) but that `jsc` variable doesn't seem to be used. [These docs](https://spark.apache.org/docs/latest/monitoring.html) mention new contexts creating a web UI server - this is the only relation I've made to the port `:4040` in your log listing — rbrtl, May 03 '19 at 15:23
@rbrtl, you are right, that jsc is not used anywhere. That's just a leftover of trying to resolve this same issue. Anyhow, yes, that sparkContext is created, everything from my `save` gets logged. Moreover, I am now sure that the mentioned exception is not coming from the `catch` clause's `e.printStackTrace()`, as I put more logging into the catch clause, and I can't see those log lines, but I still see the exception. — tamassoltesz, May 03 '19 at 15:39

score 0 · Accepted Answer · answered May 07 '19 at 09:28

I found out what the issue is: somehow the connection to elasticsearch is lost during the long-running train. This is a docker issue, not a predictionIO issue. For now, I "solved" this by not using elasticsearch at all.

Another thing I was not aware of: it does matter where you put your --verbose in the command. Providing it in the way I did originally (like pio train -- --driver-memory 4g --verbose) has no/little effect on the verbosity of the logging. The right way to do so is pio train --verbose -- --driver-memory 4g, so before the --. This way I got much more log, from which the origin of the issue became clear.

pio train fails with IOException: Connection reset by peer

1 Answers1