What is the "correct" or best way in jax to implement a Dense layer where each layer might or might not have a bias?

Question

For example in jax.experimental.stax there is an Dense layer implemented like this:

def Dense(out_dim, W_init=glorot_normal(), b_init=normal()):
  """Layer constructor function for a dense (fully-connected) layer."""
  def init_fun(rng, input_shape):
    output_shape = input_shape[:-1] + (out_dim,)
    k1, k2 = random.split(rng)
    W, b = W_init(k1, (input_shape[-1], out_dim)), b_init(k2, (out_dim,))
    return output_shape, (W, b)
  def apply_fun(params, inputs, **kwargs):
    W, b = params
    return jnp.dot(inputs, W) + b
  return init_fun, apply_fun

If we implement bias as being allowed to be None for example, or params having length 1, there are implications for how grad works.

What is the pattern that one should aim for here? jax.jit has a static_argnums that I suppose could be used with some has_bias param but book-keeping this is involved and I am sure there must be some examples somewhere.

It seems flax does this by taking the tensorflow/keras approach and slipping function options into the class state. — mathtick, Aug 20 '21 at 14:42
Have you had a look at [Haiku](https://dm-haiku.readthedocs.io/en/latest/api.html)? You could do e.g. `hk.Linear(n, with_bias=False)` — Kris, Aug 25 '21 at 07:43
@Kris Yes I will likely use that going forward just trying to stay light at the moment. — mathtick, Aug 27 '21 at 11:13

score 1 · Answer 1 · answered Sep 14 '21 at 09:22

Wouldn't this work ?

def Dense(out_dim, W_init=glorot_normal(), b_init=normal()):
  """Layer constructor function for a dense (fully-connected) layer."""

  def init_fun(rng, input_shape):
    output_shape = input_shape[:-1] + (out_dim,)
    k1, k2 = random.split(rng)
    W = W_init(k1, (input_shape[-1], out_dim))
    if b_init:
        b = b_init(k2, (out_dim,)
        return output_shape, (W, b)
    return output_shape, W

  def apply_fun(params, inputs, **kwargs):
    if len(params) == 1:
        W = params
        return jnp.dot(inputs, W)
    else:
        W, b = params
        return jnp.dot(inputs, W) + b

  return init_fun, apply_fun

What is the "correct" or best way in jax to implement a Dense layer where each layer might or might not have a bias?

1 Answers1