0

I often have an weaviate container that is freezing up with the message

[
    {
        "level": "info",
        "msg": "No network configured. Not Joining one.",
        "time": "2020-02-28T11:46:02Z"
    },
    {
        "action": "esvector_startup",
        "level": "info",
        "maxWaitTime": 120000000000,
        "msg": "waiting for es vector to start up (maximum 2m0s)",
        "time": "2020-02-28T11:46:02Z"
    },
    {
        "action": "restapi_management",
        "level": "info",
        "msg": "Serving weaviate at http://[::]:8080",
        "time": "2020-02-28T11:46:02Z"
    },
    {
        "level": "warn",
        "ts": "2020-02-29T14:39:54.813Z",
        "caller": "clientv3/retry_interceptor.go:61",
        "msg": "retrying of unary invoker failed",
        "target": "endpoint://client-5b29a880-9d71-462e-b7ac-a573d5e2e6e5/etcd:2379",
        "attempt": 0,
        "error": "rpc error: code = Unavailable desc = etcdserver: request timed out"
    }
]

On the ETCD-0 I have the following log: **On the ETCD-0 I have the following log:

2020-02-29 14:39:56.955415 I | raft: 4871a93f1a47265c [term: 316] ignored a MsgHeartbeatResp message with lower term from 5b97572db0dcba3a [term: 311]
2020-02-29 14:39:57.805445 I | raft: 4871a93f1a47265c is starting a new election at term 316
2020-02-29 14:39:57.805480 I | raft: 4871a93f1a47265c became candidate at term 317
2020-02-29 14:39:57.805493 I | raft: 4871a93f1a47265c received MsgVoteResp from 4871a93f1a47265c at term 317
2020-02-29 14:39:57.805503 I | raft: 4871a93f1a47265c [logterm: 311, index: 70906] sent MsgVote request to 5b97572db0dcba3a at term 317
2020-02-29 14:39:57.845489 I | raft: 4871a93f1a47265c received MsgVoteResp from 5b97572db0dcba3a at term 317
2020-02-29 14:39:57.845522 I | raft: 4871a93f1a47265c [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections
2020-02-29 14:39:57.845541 I | raft: 4871a93f1a47265c became leader at term 317
2020-02-29 14:39:57.845551 I | raft: raft.node: 4871a93f1a47265c elected leader 4871a93f1a47265c at term 317
2020-02-29 15:37:06.688476 N | compactor: Starting auto-compaction at revision 68399 (retention: 4h0m0s)

On my ETCD-1 I have the following log

2020-02-29 13:37:06.663962 I | mvcc: finished scheduled compaction at 66935 (took 859.28µs)
2020-02-29 14:37:06.685573 I | mvcc: store.index: compact 67655
2020-02-29 14:37:06.688036 I | mvcc: finished scheduled compaction at 67655 (took 1.405431ms)
2020-02-29 14:39:51.509681 I | raft: 5b97572db0dcba3a [logterm: 311, index: 70906, vote: 4871a93f1a47265c] ignored MsgVote from 4871a93f1a47265c [logterm: 311, index: 70906] at term 311: lease is not expired (remaining ticks: 10)
2020-02-29 14:39:52.811383 I | raft: 5b97572db0dcba3a [logterm: 311, index: 70906, vote: 4871a93f1a47265c] ignored MsgVote from 4871a93f1a47265c [logterm: 311, index: 70906] at term 311: lease is not expired (remaining ticks: 10)
2020-02-29 14:39:54.610822 I | raft: 5b97572db0dcba3a [logterm: 311, index: 70906, vote: 4871a93f1a47265c] ignored MsgVote from 4871a93f1a47265c [logterm: 311, index: 70906] at term 311: lease is not expired (remaining ticks: 10)
2020-02-29 14:39:55.610694 I | raft: 5b97572db0dcba3a [logterm: 311, index: 70906, vote: 4871a93f1a47265c] ignored MsgVote from 4871a93f1a47265c [logterm: 311, index: 70906] at term 311: lease is not expired (remaining ticks: 10)

After restarting the container, everything works well.

I have deployed Weaviate on K8S, and have the following config for the ETCD:

# Etcd
#
# Weaviate stores critical configuration where strong consistency is required
# in etcd.
etcd:
  fullnameOverride: etcd
  envVarsConfigMap: 'etcd-config'
  statefulset:
    replicaCount: 2
  ##
  auth:
    rbac:
      enabled: false
    client:
      ## Switch to encrypt client communication using TLS certificates
      secureTransport: false
      ## Switch to automatically create the TLS certificates
      useAutoTLS: false
      enableAuthentication: false
    peer:
      ## Switch to encrypt client communication using TLS certificates
      secureTransport: true
      ## Switch to automatically create the TLS certificates
      useAutoTLS: true
      ## Switch to enable host authentication using TLS certificates. Requires existing secret.
      enableAuthentication: false
  metrics:
    enabled: true
    podAnnotations:
      prometheus.io/scrape: 'true'
    prometheus.io/port: '2379'
  disasterRecovery:
    # If you set `enabled: true` you need to make sure that an NFS provisioner
    # runs in your cluster! See
    # https://github.com/bitnami/charts/tree/master/bitnami/etcd#disaster-recovery
    # Defaults to 'false' so the chart works without an NFS provisioner.
    # However, 'enabled: true' is strongly recommended!
    enabled: false
    cronjob:
      schedule: '*/30 * * * *'
      historyLimit: 1
      podAnnotations: {}
    pvc:
      size: 2Gi
      storageClassName: default
  startFromSnapshot:
    enabled: false
    ## Existingn PVC containing the etcd snapshot
    ##
    # existingClaim
    ## Snapshot filename
    ##
    # snapshotFilename
    #

Is there anything wrong I my configuration that causes to freezup the Weaviate container?

Thanks!

Bob van Luijt
  • 7,153
  • 12
  • 58
  • 101
Jeroen
  • 66
  • 2
  • The only issue I can see is that `default` is most likely not a supported storageClassName for `etcd.disasterRecovery.pvc.storageClassName` [as explained here](https://stackoverflow.com/questions/60231923/issues-while-deploying-weaviate-on-aks-azure-kubernetes-service/60505796#60505796). However, since `etcd.disasterRecovery.pvc` is set to `false` the above setting should have no effect anyway. If you can consistently reproduce the error, feel free to open an issue on [semi-technologies/weaviate-helm](https://github.com/semi-technologies/weaviate-helm/issues). Thanks. – etiennedi Mar 03 '20 at 11:22
  • Hi @etiennedi unfortunately I cannot reproduce it at this moment. The only actions I execute in Weavite are creating things and classification jobs (after the thing is created). Is there a place/log that I could use to have a closer look at this issue? The issue occurs multiple times per day. – Jeroen Mar 03 '20 at 11:51
  • It seems that after a while the etcd client loses its ability to connect to etcd. The client itself is stateful in that it tries to keep the connection open. That explains why a restart of the weaviate pod fixes the problem. After a restart you create a new client. We could potentially automate this behavior by making Weaviate fail its health checks if it can't connect to etcd. But before we put something like that in, we must understand why the connection to etcd leads to a timeout. – etiennedi Mar 03 '20 at 14:23
  • Totally understand that. Please let me know if I could help you by sending some logs or facilitate in a remote debugging session. – Jeroen Mar 03 '20 at 17:58

0 Answers0