4

I'm trying to setup a three-node Aerospike cluster on Ubuntu 14.04. Apart from the IP address/name, each machine is identical. I installed Aerospike and the management console, per the documentation, on each machine.

I then edited the network/service and network/heartbeat sections in /etc/aerospike/aerospike.conf:

network {
    service {
        address any
        port 3000
        access-address 10.0.1.11  # 10.0.1.12 and 10.0.1.13 on the other two nodes
    }

    heartbeat {
        mode mesh
        port 3002
        mesh-seed-address-port 10.0.1.11 3002
        mesh-seed-address-port 10.0.1.12 3002
        mesh-seed-address-port 10.0.1.13 3002
        interval 150
        timeout 10
    }

[...]

}

When I sudo service aerospike start on each of the nodes, the service runs but it's not clustered. If I try to add another node in the management console, it informs me: "Node 10.0.1.12:3000 cannot be monitored here as it belongs to a different cluster."

Can you see what I'm doing wrong? What changes should I make to aerospike.conf, on each of the nodes, in order to setup an Aerospike cluster instead of three isolated instances?

Alex Woolford
  • 4,433
  • 11
  • 47
  • 80

1 Answers1

8

Your configuration appears correct.

Check if you are able to open a TCP connection over ports 3001 and 3002 from each host to the rest.

nc -z -w5 <host> 3001; echo $?
nc -z -w5 <host> 3002; echo $?

If not I would first suspect firewall configuration.

Update 1:

The netcat commands returned 0 so let's try to get more info.

Run and provide the output of the following on each node:

asinfo -v service
asinfo -v services
asadm -e info

Update 2:

After inspecting the output in the gists, the asadm -e "info net" indicated that all nodes had the same Node IDs.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node               Node                        Fqdn               Ip   Client     Current      HB        HB   
   .                 Id                           .                .    Conns        Time    Self   Foreign   
h      *BB9000000000094   hadoop01.woolford.io:3000   10.0.1.11:3000       15   174464730   37129         0   
Number of rows: 1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node               Node                        Fqdn               Ip   Client     Current      HB        HB   
   .                 Id                           .                .    Conns        Time    Self   Foreign   
h      *BB9000000000094   hadoop03.woolford.io:3000   10.0.1.13:3000        5   174464730   37218         0   
Number of rows: 1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node               Node                        Fqdn               Ip   Client     Current      HB        HB   
   .                 Id                           .                .    Conns        Time    Self   Foreign   
h      *BB9000000000094   hadoop02.woolford.io:3000   10.0.1.12:3000        5   174464731   37203         0   
Number of rows: 1

The Node ID is constructed with the fabric port (port 3001 in hex) followed by the MAC address in reverse byte order. Another flag was that the "HB Self" was non-zero and is expected to be zero in a mesh configuration (in a multicast configuration this will also be non-zero since the nodes will receive their own heartbeat messages).

Because all of the Node IDs are the same, this would indicate that all of the MAC address are the same (though it is possible to change the node IDs using rack aware). Heartbeats that appear to have originated from the local node (determined by hb having the same node id) are ignored.

Update 3:

The MAC addresses are all unique, which contradicts previous conclusions. A reply provided the interface name being used, em1, which is not an interface name Aerospike looks for. Aerospike looks for interfaces named either eth#, bond#, or wlan#. I assume since the name wasn't one of the expected three this caused the issue with the MAC addresses; if so I would suspect the following warning exists in the logs?

Tried eth,bond,wlan and list of all available interfaces on device.Failed to retrieve physical address with errno %d %s

For such scenarios the network-interface-name parameter may be used to instruct Aerospike which interface to use for node id generation. This parameter also determines which interface's IP address should be advertised to the client applications.

network {
    service {
        address any
        port 3000
        access-address 10.0.1.11  # 10.0.1.12 and 10.0.1.13 on the other two nodes
        network-interface-name em1 # Needed for Node ID
    }

Update 4:

With the 3.6.0 release, these device names will be automatically discovered. See AER-4026 in release notes.

kporter
  • 2,684
  • 17
  • 26
  • Thanks, @kporter. Ports 3000, 3001, 3002, and 3003 all return a `0` between each node when I run the netcat test which, I believe, means they can all communicate. – Alex Woolford Jul 13 '15 at 06:07
  • See updated response. Responded there for formatting. – kporter Jul 13 '15 at 06:24
  • I posted the output on https://gist.github.com/alexwoolford/9c0e865768ac65b932ec since it's a bit verbose. I appreciate your help. :) – Alex Woolford Jul 13 '15 at 06:30
  • Alright, your problem is that all of your MAC addresses are the same. Aerospike uses the fabric port and MAC address to construct the node's ID. All of your nodes have the same node id. – kporter Jul 13 '15 at 06:36
  • If I run `ansible cluster -a "cat /sys/class/net/em1/address"` it returns three MAC addresses: 6c:3b:e5:2b:9a:ef, 6c:3b:e5:28:62:99, and 74:46:a0:c1:4f:87. It's strange that all the nodes ended up with the same node ID. – Alex Woolford Jul 13 '15 at 06:54
  • Yes that is strange, you wouldn't happen to be using rack aware? – kporter Jul 13 '15 at 06:57
  • 1
    Not knowingly. I wonder if there's a way to force Aerospike to regenerate the node ID's. Reinstall? – Alex Woolford Jul 13 '15 at 06:58
  • I think this issue is because the interface name is em# and not eth#. Either [network-interface-name](http://www.aerospike.com/docs/reference/configuration/#network-interface-name) or [interface-address](http://www.aerospike.com/docs/reference/configuration/#interface-address) controls which interface will be used for generation the node_id. I look up which one it is. – kporter Jul 13 '15 at 07:06
  • 2
    You will need to set the **network-interface-name** to **em1** in the **network.service** context. Aerospike will need to be restarted after this is updated. – kporter Jul 13 '15 at 07:21
  • 3
    THAT WORKED! You are my hero! – Alex Woolford Jul 13 '15 at 12:39
  • @AlexWoolford, We are working to eliminate this issue but there were some questions about why some existing code didn't find your interfaces. Do you still have this environment available and could you provide the output of either `ifonfig` or `ip addr`, this should shed some light on the problem for us. – kporter Aug 06 '15 at 22:33
  • Great! we now know what happened, thanks for the info. – kporter Aug 06 '15 at 23:30