I am trying to upgrade a riak_core erlang application while it is running.
Simple upgrades are working. I use rebar3 and relflow to upgrade the application succesfully. However, if I change the internals of a vnode and use relflow and rebar3 relup to generate a new release, the vnode stops working. The vnode is called 'cavv'.
After the hot upgrade, it crashes at this point:
DocIdx = riak_core_util:chash_key({<<"run">>, term_to_binary(os:timestamp())}),
which results in this error:
** exception error: bad argument
in function lists:keyfind/3
called as lists:keyfind(chash_keyfun,1,[{name,<<"run">>}|undefined])
in call from riak_core_util:chash_key/2 (_build/default/lib/riak_core/src/riak_core_util.erl, line 266)
in call from cavv_vnode:run/0 (_build/prod/lib/cavv/src/cavv_vnode.erl, line 38)
My relup looks like this:
{"0.1.2",
[{"0.1.1",[],
[{load_object_code,{cavv,"20161203-211601-relflow",[cavv_vnode]}},
point_of_no_return,
{load,{cavv_vnode,brutal_purge,brutal_purge}}]}],
[{"0.1.1",[],[point_of_no_return]}]}.
Am I missing something? Do I have to restart some master vnode? I tried restarting some supervisors, without success.
Looking at the source code of riak_core:
%% @spec chash_key(BKey :: riak_object:bkey()) -> chash:index()
%% @doc Create a binary used for determining replica placement.
chash_key({Bucket,_Key}=BKey) ->
BucketProps = riak_core_bucket:get_bucket(Bucket),
chash_key(BKey, BucketProps).
%% @spec chash_key(BKey :: riak_object:bkey(), [{atom(), any()}]) ->
%% chash:index()
%% @doc Create a binary used for determining replica placement.
chash_key({Bucket,Key}, BucketProps) ->
{_, {M, F}} = lists:keyfind(chash_keyfun, 1, BucketProps), %% <-- Line 266
M:F({Bucket,Key}).
I tried to understand what is going on, but had a hard time grasping what is happening. Somehow something in BucketProps is undefined what should not be undefined after the upgrade?
When I restart the whole application, it works like a charm.
Am I missing something during my hot upgrade with riak_core? Or is it better to just shut down the whole node, then upgrade and start it up again and forget about hot code upgrading?
UPDATE In the mean time I have found out that something goes wrong with the riak_core_bucket.
Running the following: riak_core_bucket:get_bucket(<<"run">>).
Before the upgrade:
[{name,<<"run">>},
{allow_mult,false},
{basic_quorum,false},
{big_vclock,50},
{chash_keyfun,{riak_core_util,chash_std_keyfun}},
{dvv_enabled,false},
{dw,quorum},
{last_write_wins,false},
{linkfun,{modfun,riak_kv_wm_link_walker,mapreduce_linkfun}},
{n_val,3},
{notfound_ok,true},
{old_vclock,86400},
{postcommit,[]},
{pr,0},
{precommit,[]},
{pw,0},
{r,quorum},
{rw,quorum},
{small_vclock,50},
{w,quorum},
{young_vclock,20}]
After the upgrade:
[{name,<<"run">>}|undefined]
Undefined is returned by app_helper:get_env(riak_core, default_bucket_props).
after the upgrade.
I have found out it tries to process sys.config during the upgrade:
Warning: "_build/prod/rel/cavv/releases/0.1.2/sys.config" missing (optional)
Using the generated app.conf is not enough, as it not contains all config values previously shown. Using it only outputs: [{n_val,3}]
.
Maybe something with Cuttlefish not properly reloading conf files?
UPDATE2
Done some more digging. After the upgrade application:get_all_env(riak_core).
returns different values. Any ideas?