0

I am setting up a riak cluster of five physical nodes. The other four are fine with all tests except one fails the admin-riak test. The cluster state on several riak-admin commands is shown below

do-admin@DBNode1:~$ sudo riak-admin member-status  
=============== Membership ============================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      20.3%      --      'riak@dbnode1.do.ug'
valid      20.3%      --      'riak@dbnode2.do.ug'
valid      20.3%      --      'riak@dbnode3.do.ug'
valid      20.3%      --      'riak@dbnode4.do.ug'
valid      18.8%      --      'riak@dbnode5.do.ug'
-------------------------------------------------------------------------------
Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

dot-admin@DBNode1:~$ sudo riak-admin ring-status
================================== Claimant ===================================
Claimant:  'riak@dbnode2.do.ug'
Status:     up
Ring Ready: true

============================== Ownership Handoff ==============================
No pending changes.

============================== Unreachable Nodes ==============================
All nodes are up and reachable

do-admin@DBNode1:~$ sudo riak-admin cluster status
---- Cluster Status ----
Ring ready: true

+------------------------+------+-------+-----+-------+
|       node             |status| avail |ring |pending|
+------------------------+------+-------+-----+-------+
|     riak@dbnode1.do.ug |valid |  up   | 20.3|  --   |
| (C) riak@dbnode2.do.ug |valid |  up   | 20.3|  --   |
|     riak@dbnode3.do.ug |valid |  up   | 20.3|  --   |
|     riak@dbnode4.do.ug |valid |  up   | 20.3|  --   |
|     riak@dbnode5.do.ug |valid |  up   | 18.8|  --   |
+------------------------+------+-------+-----+-------+

Key: (C) = Claimant; availability marked with '!' is unexpected

do-admin@DBNode1:~$ curl -v http://dbnode1.dot.ug:8098/types/default/props
* Hostname was NOT found in DNS cache
*   Trying 192.168.172.38...
* Connected to dbnode1.dot.ug (192.168.172.38) port 8098 (#0)
> GET /types/default/props HTTP/1.1
> User-Agent: curl/7.35.0
> Host: dbnode1.dotshule.ug:8098
> Accept: */*
> 
< HTTP/1.1 200 OK
< Vary: Accept-Encoding
* Server MochiWeb/1.1 WebMachine/1.10.5 (jokes are better explained) is not blacklisted
< Server: MochiWeb/1.1 WebMachine/1.10.5 (jokes are better explained)
< Date: Sat, 17 Jan 2015 21:05:22 GMT
< Content-Type: application/json
< Content-Length: 428
< 
* Connection #0 to host dbnode1.dotshule.ug left intact
{"props":{"allow_mult":false,"basic_quorum":false,"big_vclock":50,"chash_keyfun":{"mod":"riak_core_util","fun":"chash_std_keyfun"},"dvv_enabled":false,"dw":"quorum","last_write_wins":false,"linkfun":{"mod":"riak_kv_wm_link_walker","fun":"mapreduce_linkfun"},"n_val":3,"notfound_ok":true,"old_vclock":86400,"postcommit":[],"pr":0,"precommit":[],"pw":0,"r":"quorum","rw":"quorum","small_vclock":50,"w":"quorum","young_vclock":20}}

dot-admin@DBNode1:~$ sudo riak-admin test
Node 'riak@dbnode1.dot.ug ' is not reachable from 'riak_test@dbnode1.dot.ug'.

After all those tests with the same results on all nodes except for the **riak-admin test ** which is like this on all other nodes. For node three

dot-admin@DBNode3:~$ sudo riak-admin test
Successfully completed 1 read/write cycle to 'riak@dbnode3.dotshule.ug'

My doubt is whether this cluster is ready to used to store data because on this Basho website resource, they say you can use any of the methods to test whether the node is ready. They do not say that the node is still fine if one method success and the other fails. So I am stuck on whether to go ahead to use the cluster or not. Surprisingly this node has succeeded on all operations to join it in the cluster!! I have tried creating this node from scratch again but that has not helped!

For any help, I will be glad.

  • Was the nodename changed in the config file after Riak was started? The riak-admin test command should pull the node name from the config file via the env.sh script. The fact that it is getting `dbnode1.dot.ug` when the member_status shows `riak@dbnode1.do.ug` would seem to imply that it was changed. – Joe Jan 19 '15 at 15:32
  • No it is an edit error all nodes are of this nomenclature `riak@dbnodeN.do.ug` – Vianney Sserwanga Jan 20 '15 at 12:02

1 Answers1

0

The error Node <targetnode> is not reachable from <sourcenode>. indicates that net_adm:ping(<targetnode>) returned pang instead of pong.

Check that:

  • the epmd process is running
  • /usr/lib/riak/erts-5.10.3/epmd -names show that the node is registered (adjust path to match your install)
  • selinx, iptables or other security software is not not blocking ephemeral ports or the epmd port 4369

The net_adm module will attempt to resolve the host part of the target nodename, i.e. the part after the '@', contact epmd on port 4369 at that IP address to get the port for the named node, then establish a TCP connection to the node at that port. Something in this process is not completing.

Joe
  • 25,000
  • 3
  • 22
  • 44
  • When i run `sudo riak attach` on any other node to get access to the erlang terminal and then i run `net_adm:ping('riak@dbnode1.do.ug')`. I get a `pong`. When i run `nodes()`, it also appears. I get the same results for other nodes when i carryout the same procedure on this node!! The **epmd port 4369** is also open for all the nodes.... – Vianney Sserwanga Jan 21 '15 at 09:47
  • I do not know the internal operation of this **riak-admin** shell command `sudo riak-admin test` and still i do not know where this other node `riak_test@dbnode1.do.ug` in the response `Node 'riak@dbnode1.do.ug ' is not reachable from 'riak_test@dbnode1.do.ug'.` comes from when i run that command `sudo riak-admin test`. [Basho](http://docs.basho.com/riak/latest/ops/building/installing/post-install/) says if it fails then the node is not ready!! – Vianney Sserwanga Jan 21 '15 at 09:56
  • what about `riak ping` from the node having trouble? – Joe Jan 21 '15 at 13:05
  • `sudo riak ping` returns `pong`. I think this indicates the riak node is running on the resident (local) machine. – Vianney Sserwanga Jan 22 '15 at 10:27
  • That is very strange, because it executes exactly the same check that causes the error you noted. `riak ping` uses the code [here](https://github.com/basho/node_package/blob/develop/priv/base/nodetool#L41-L51) to ping the node, `riak-admin test` uses the code [here](https://github.com/basho/riak/blob/2.0/rel/files/riak-admin#L1004-L1018) to call the [client_test function](https://github.com/basho/riak_kv/blob/develop/src/riak.erl#L134-L156). Both test with `net_adm:ping`. If `riak ping` can ping the local node but `riak-admin test` cannot, perhaps a permissions or selinux issue? – Joe Jan 22 '15 at 13:58
  • SELinux is not install and permissions are defaults if 'i may say' whatsoever!!. I have run even `riak-admin diag` everything seems ok. I have used the cluster in this state is seems to be working normally! Now this is strange. – Vianney Sserwanga Jan 22 '15 at 14:56
  • It is especially strange because `riak-admin test` first calls `node_up_check`, which calls `net_adm:ping`, which succeeds (it would abort otherwise). It then calls `riak:client_test` which also calls `net_adm:ping`, which then fails. – Joe Jan 22 '15 at 17:35
  • I think when `riak-admin test` is setting up the test node `riak-test@dbnode1.do.ug` the cookie it uses may differ from the cookie the node uses since i changed the cookie value from the default value in `riak.conf` file. However, all other nodes have the same cookie parameter value changed in there riak.conf files and they are fine with `riak-admin test`. – Vianney Sserwanga Jan 23 '15 at 02:33
  • To verify that, edit the riak-admin script on the trouble machine, adding `echo` to the begging of [line 1013](https://github.com/basho/riak/blob/2.0/rel/files/riak-admin#L1013), if the setcookie parameter there differs from the cookie in your config file, file a bug against the project. – Joe Jan 23 '15 at 14:31