2

Provisioning a DSE cluster with the lifecycle manager fails consitently. Master node (also the one OpsCenter is running on) installed correctly. Each one of the other nodes fails the install (also config) task. Have double-checked the SSH credentials and ports. Any ideas on how to investigate further and fix the issue would be great.

Please excuse the length - trying to provide all of the relevant info.

Ubuntu 14.04.4, JRE: 1.8.0.91, DSE 5.0.0

job events:

   ...
    "results": [
        {
            "event-subtype": "start",
            "event-type": "milestone",
            "message": "job started...",
            ...
        },
        {
            "event-subtype": "invocation",
            "event-type": "shell-command",
            "message": "Invoked command: if [ -x $(which yum) ] && [ -f /etc/redhat-release -o -f /etc/SuSE-release ]; then echo -n yum; elif [ -x $(which apt-get) ]; then echo -n apt; fi"
            ...
        },
        {
            "event-subtype": "uploaded-facts",
            "event-type": "milestone",
            "message": "Uploaded facts to OpsCenter server",
            ...
        },
        {
            "event-subtype": "meld-error",
            "event-type": "error",
            "message": "Unexpected error executing meld",
            ...
        },
        {
            "event-subtype": "MeldError",
            "event-type": "error",
            "message": "Meld failed on: name=\"NODE-2\" ssh-management-address=\"<IP>\" node-id=\"<node-id>\" job-id=\"<job-id>\" stdout=\"\r\n\" stderr=\"\"",
            ...
        }
    ]

opscenterd.log

/var/log/opscenter/opscenterd.log-2016-07-02 16:34:16,848 [opscenterd]  INFO: Install job started for node name="NODE-2" ssh-management-address="<IP>" node-id="<node-id>" (async-thread-macro-53)
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:16,850 [opscenterd]  INFO: using ssh-private-key (async-thread-macro-53)
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:18,478 [opscenterd]  INFO: Received milestone from node name="NODE-2" ssh-management-address="<IP>" node-id="<node-id>" message="Uploaded facts to OpsCenter server" job-id="a630c081-6ac1-4b00-ac08-18fef320e0d5" (MainThread)
/var/log/opscenter/opscenterd.log:2016-07-02 16:34:18,675 [opscenterd] ERROR: Received error from node event-subtype="meld-error" job-id="a630c081-6ac1-4b00-ac08-18fef320e0d5" name="NODE-2" traceback="Traceback (most recent call last):
/var/log/opscenter/opscenterd.log:  File \"meld.py\", line 3313, in run
/var/log/opscenter/opscenterd.log-    rc = engine.go()
/var/log/opscenter/opscenterd.log:  File \"meld.py\", line 2991, in go
/var/log/opscenter/opscenterd.log-    self.file_manager.get_config_files()
/var/log/opscenter/opscenterd.log:  File \"meld.py\", line 1280, in get_config_files
/var/log/opscenter/opscenterd.log-    {\"accept\": \"application/json\"})
/var/log/opscenter/opscenterd.log:  File \"meld.py\", line 598, in get
/var/log/opscenter/opscenterd.log-    return json.loads(response.read())
/var/log/opscenter/opscenterd.log-  File \"/usr/lib/python2.7/socket.py\", line 351, in read
/var/log/opscenter/opscenterd.log-    data = self._sock.recv(rbufsize)
/var/log/opscenter/opscenterd.log-  File \"/usr/lib/python2.7/httplib.py\", line 549, in read
/var/log/opscenter/opscenterd.log-    return self._read_chunked(amt)
/var/log/opscenter/opscenterd.log-  File \"/usr/lib/python2.7/httplib.py\", line 609, in _read_chunked
/var/log/opscenter/opscenterd.log-    value.append(self._safe_read(amt))
/var/log/opscenter/opscenterd.log-  File \"/usr/lib/python2.7/httplib.py\", line 666, in _safe_read
/var/log/opscenter/opscenterd.log-    raise IncompleteRead(''.join(s), amt)
/var/log/opscenter/opscenterd.log:IncompleteRead: IncompleteRead(4153 bytes read, 4039 more expected)" ssh-management-address="<IP>" node-id="<node-id>" event-type="error" message="Unexpected error executing meld" (MainThread)
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:18,892 [opscenterd] ERROR: Install job a630c081-6ac1-4b00-ac08-18fef320e0d5 failed! (async-thread-macro-54)
/var/log/opscenter/opscenterd.log:2016-07-02 16:34:19,105 [opscenterd] ERROR: Meld failed on: name="NODE-2" ssh-management-address="<IP>" node-id="<node-id>" job-id="a630c081-6ac1-4b00-ac08-18fef320e0d5" stdout="
/var/log/opscenter/opscenterd.log-" stderr="" (async-thread-macro-53)

Thank you

EDIT: Captured the HTTP traffic between NODE2 and master. The error occurs while transferring config files. One of them is not transferred completely for some reason. The json looks resonable until some gibberish appears.

 {"filename": "dse.yaml", "contents": {"internode_messaging_options": {"client_worker_threads": 16, "port": 8609, "server_worker_threads": 16, "server_acceptor_thread

Yvatv+~UK{.kMI4^QOrqQTDX_3"DPm,v!"H&M$!1M7

LRYCs{l>-df;cj

W6C9dq

The config files are valid and do work on the master node. Only the replication fails.

Mike Lococo
  • 684
  • 3
  • 8
kostja
  • 60,521
  • 48
  • 179
  • 224
  • this unanswered issue could be related: https://stackoverflow.com/questions/38153032 – kostja Jul 04 '16 at 07:08
  • Please confirm that you can log into Node 2 with the IP and credentials provided to LCM – phact Jul 04 '16 at 14:02
  • Yes, confirmed. LCM can login as well, deposits a meld.py over ssh. The meld.py then transfers some of the config files over HTTP but fails on transfer. Please see the edit. – kostja Jul 05 '16 at 11:16
  • Consider tagging future opscenter/LCM questions with the OpsCenter tag in addition to the datastax tag, as some OpsCenter devs keep an eye on that tag. I made a proposed edit to your post to add that tag here. – Mike Lococo Jul 05 '16 at 14:19
  • sure, wasnt' aware of its existence yet – kostja Jul 05 '16 at 14:25

2 Answers2

1

OpsCenter LCM developer here. Your issue is caused by OPSC-8851 in the LCM known issues list: http://docs.datastax.com/en/opscenter/6.0/opsc/release_notes/opscReleaseNotes600.html

This is only triggered under certain network conditions and was discovered too close to release to get fixed in 6.0.0. It's a high priority though, and will be fixed in a subsequent release soon. Unfortunately, I don't think there's anything you can do to work around this in the field. If you're a DataStax customer, you could contact support and potentially get a patch now to workaround the issue... otherwise the only thing I can suggest is to watch the upcoming release notes.

Edit: I should also note that in our tests the issue is intermittent. LCM is designed so you can rerun failed jobs safely (aka it's idempotent) so in all but the most extreme cases you can also work around this just by rerunning your job.

Mike Lococo
  • 684
  • 3
  • 8
  • Thanks for the response. This does look plausible. Will wait for next bugfix release to confirm. – kostja Jul 05 '16 at 14:22
  • rerunning the job does not seem to fix the issue in my case - have been attempting to rerun for some days by now :) Perhaps a different issue then. – kostja Jul 05 '16 at 14:33
  • More likely a more extreme case of the same issue. As noted, it's related to network conditions and possibly your connection causes pathological behavior. Apologies for a poor introductory experience with the new feature. This is an ugly bug. I hope you'll give it another shot once this is fixed. – Mike Lococo Jul 05 '16 at 14:42
  • I will. I have had a good experience with 4.8.x and really look forward to expand on it. Thanks for a great product. – kostja Jul 06 '16 at 09:03
0

You can specify the private IP for Listen Address and 0.0.0.0 for broadcast address and LCM should be able to provision appropriately.

  • sounds interesting. could you please add more info - where do I make he change - C* yaml, opscenter settings or LCM settings? I am not too famliar with the toolset. – kostja Jul 07 '16 at 09:21
  • Anything is worth a try, but I'm fairly certain that bad ip-address settings can't cause the IncompleteRead error. You'd see DSE fail to startup instead, but this error is caused by meld failing to download something successfully... the various DSE ip settings don't apply there (beyond ssh-management-address being correct enough to ssh in and start meld). – Mike Lococo Jul 07 '16 at 13:28