How to control if input features contribute exclusively to one neuron in subsequent layer of a Tensorflow neural network?

Question

I'm trying to make the most basic of basic neural networks to get familiar with functional API in Tensorflow 2.x.

Basically what I'm trying to do is the following with my simplified iris dataset (i.e. setosa or not)

Use the 4 features as input
Dense layer of 3
Sigmoid activation function
Dense layer of 2 (one for each class)
Softmax activation
Binary cross entropy / log-loss as my loss function

However, I can't figure out how to control one key aspect of the model. That is, how can I ensure that each feature from my input layer contributes to only one neuron in my subsequent dense layer? Also, how can I allow a feature to contribute to more than one neuron?

This isn't clear to me from the documentation.

# Load data
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
X, y = load_iris(return_X_y=True, as_frame=True)
X = X.astype("float32")
X.index = X.index.map(lambda i: "iris_{}".format(i))
X.columns = X.columns.map(lambda j: j.split(" (")[0].replace(" ","_"))
y.index = X.index
y = y.map(lambda i:iris.target_names[i])
y_simplified = y.map(lambda i: {True:1, False:0}[i == "setosa"])
y_simplified = pd.get_dummies(y_simplified, columns=["setosa", "not_setosa"])

# Traing test split
from sklearn.model_selection import train_test_split
seed=0
X_train,X_test, y_train,y_test= train_test_split(X,y_simplified, test_size=0.3, random_state=seed)

# Simple neural network
import tensorflow as tf
tf.random.set_seed(seed)


# Input[4 features] -> Dense layer of 3 neurons -> Activation function -> Dense layer of 2 (one per class) -> Softmax
inputs = tf.keras.Input(shape=(4))
x = tf.keras.layers.Dense(3)(inputs)
x = tf.keras.layers.Activation(tf.nn.sigmoid)(x)
x = tf.keras.layers.Dense(2)(x)
outputs = tf.keras.layers.Activation(tf.nn.softmax)(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs, name="simple_binary_iris")
model.compile(loss="binary_crossentropy", metrics=["accuracy"] )
model.summary()

history = model.fit(X_train, y_train, batch_size=64, epochs=10, validation_split=0.2)

test_scores = model.evaluate(X_test, y_test)
print("Test loss:", test_scores[0])
print("Test accuracy:", test_scores[1])

Results:

Model: "simple_binary_iris"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_44 (InputLayer)        [(None, 4)]               0         
_________________________________________________________________
dense_96 (Dense)             (None, 3)                 15        
_________________________________________________________________
activation_70 (Activation)   (None, 3)                 0         
_________________________________________________________________
dense_97 (Dense)             (None, 2)                 8         
_________________________________________________________________
activation_71 (Activation)   (None, 2)                 0         
=================================================================
Total params: 23
Trainable params: 23
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
2/2 [==============================] - 0s 40ms/step - loss: 0.6344 - accuracy: 0.6667 - val_loss: 0.6107 - val_accuracy: 0.7143
Epoch 2/10
2/2 [==============================] - 0s 6ms/step - loss: 0.6302 - accuracy: 0.6667 - val_loss: 0.6083 - val_accuracy: 0.7143
Epoch 3/10
2/2 [==============================] - 0s 7ms/step - loss: 0.6278 - accuracy: 0.6667 - val_loss: 0.6056 - val_accuracy: 0.7143
Epoch 4/10
2/2 [==============================] - 0s 7ms/step - loss: 0.6257 - accuracy: 0.6667 - val_loss: 0.6038 - val_accuracy: 0.7143
Epoch 5/10
2/2 [==============================] - 0s 7ms/step - loss: 0.6239 - accuracy: 0.6667 - val_loss: 0.6014 - val_accuracy: 0.7143
Epoch 6/10
2/2 [==============================] - 0s 7ms/step - loss: 0.6223 - accuracy: 0.6667 - val_loss: 0.6002 - val_accuracy: 0.7143
Epoch 7/10
2/2 [==============================] - 0s 7ms/step - loss: 0.6209 - accuracy: 0.6667 - val_loss: 0.5989 - val_accuracy: 0.7143
Epoch 8/10
2/2 [==============================] - 0s 7ms/step - loss: 0.6195 - accuracy: 0.6667 - val_loss: 0.5967 - val_accuracy: 0.7143
Epoch 9/10
2/2 [==============================] - 0s 7ms/step - loss: 0.6179 - accuracy: 0.6667 - val_loss: 0.5953 - val_accuracy: 0.7143
Epoch 10/10
2/2 [==============================] - 0s 7ms/step - loss: 0.6166 - accuracy: 0.6667 - val_loss: 0.5935 - val_accuracy: 0.7143
2/2 [==============================] - 0s 607us/step - loss: 0.6261 - accuracy: 0.6444
Test loss: 0.6261375546455383
Test accuracy: 0.644444465637207

I do not understand the purpose... if you force contribution to only one neuron, you are not really allowing for (much) learning. There are things like *guided dropout* to somewhat address what you are asking, and [here is some example code](https://github.com/BDonnot/guided_dropout). — Mike Williamson, Aug 06 '20 at 18:33
I'm trying to code the NN so it decides which neuron is the most important (not predefined) and then only use that. Essentially I'm trying to "combine" features mutually exclusively as the most basic form of feature engineering. I'm asking the question, how can combining features together be used to increase the accuracy? Therefore, the feature combinations are interpreted in the context of what is being classified. — O.rka, Aug 06 '20 at 18:37
In that case, I am not sure a NN is the best solution. Regardless, I am not trying to dodge your question. Look at the link I provided above for an example. Specifically, [look here](https://github.com/BDonnot/guided_dropout/blob/master/GuidedDropout/GuidedDropout.py#L279). It is not the easiest to follow, but it has a "mask" example doing precisely what you want to do. — Mike Williamson, Aug 06 '20 at 19:03
Thanks, I'm going to look into this. i'm wondering if a bernouli layer could be useful. If there could be a probability assigned to each feature going to a particular neuron. The tricky part would be, again the original problem, of making sure only a single neuron is activated for each feature. Do you think this layout is even feasible with tensorflow_probability? https://i.imgur.com/UtAJ70d.png — O.rka, Aug 06 '20 at 19:06

mujjiga · Answer 1 · 2020-08-08T14:04:50.967

how can I ensure that each feature from my input layer contributes to only one neuron in my subsequent dense layer?

Have one input layer per feature and feed each input layer to a separate dense layer. Later you can concatenate the output of all the dense layers and proceed.

NOTE: One neuron can take any size input (in this case the input size is 1 as you want one feature to be used by the neuron) and the output size if always 1. A Dense layer with with n units will have n neurons and and so will have output size of n.

Working Sample

import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Model architecutre 
x1 = tf.keras.Input(shape=(1,))
x2 = tf.keras.Input(shape=(1,))
x3 = tf.keras.Input(shape=(1,))
x4 = tf.keras.Input(shape=(1,))

x1_ = tf.keras.layers.Dense(3, activation=tf.nn.relu)(x1)
x2_ = tf.keras.layers.Dense(3, activation=tf.nn.relu)(x2)
x3_ = tf.keras.layers.Dense(3, activation=tf.nn.relu)(x3)
x4_ = tf.keras.layers.Dense(3, activation=tf.nn.relu)(x4)

merged = tf.keras.layers.concatenate([x1_, x2_, x3_, x4_])
merged = tf.keras.layers.Dense(16, activation=tf.nn.relu)(merged)
outputs = tf.keras.layers.Dense(3, activation=tf.nn.softmax)(merged)

model = tf.keras.Model(inputs=[x1,x2,x3,x4], outputs=outputs)
model.compile(loss="sparse_categorical_crossentropy", metrics=["accuracy"] )

# Load and prepare data
iris = load_iris()
X = iris.data
y = iris.target
X_train,X_test, y_train,y_test= train_test_split(X,y, test_size=0.3)

# Fit the model
model.fit([X_train[:,0],X_train[:,1],X_train[:,2],X_train[:,3]], y_train, batch_size=64, epochs=100, validation_split=0.25)

# Evaluate the model
test_scores = model.evaluate([X_test[:,0],X_test[:,1],X_test[:,2],X_test[:,3]], y_test)
print("Test loss:", test_scores[0])
print("Test accuracy:", test_scores[1])

Output:

Epoch 1/100
2/2 [==============================] - 0s 75ms/step - loss: 1.6446 - accuracy: 0.4359 - val_loss: 1.6809 - val_accuracy: 0.5185
Epoch 2/100
2/2 [==============================] - 0s 10ms/step - loss: 1.4151 - accuracy: 0.6154 - val_loss: 1.4886 - val_accuracy: 0.5556
Epoch 3/100
2/2 [==============================] - 0s 9ms/step - loss: 1.2725 - accuracy: 0.6795 - val_loss: 1.3813 - val_accuracy: 0.5556
Epoch 4/100
2/2 [==============================] - 0s 9ms/step - loss: 1.1829 - accuracy: 0.6795 - val_loss: 1.2779 - val_accuracy: 0.5926
Epoch 5/100
2/2 [==============================] - 0s 10ms/step - loss: 1.0994 - accuracy: 0.6795 - val_loss: 1.1846 - val_accuracy: 0.5926
Epoch 6/100
.................. [ Truncated ] 
Epoch 100/100
2/2 [==============================] - 0s 2ms/step - loss: 0.4049 - accuracy: 0.9333
Test loss: 0.40491223335266113
Test accuracy: 0.9333333373069763

Pictorial representation of the above model architecture

Thanks for the answer! I like this implementation. I've adjusted the code just a bit and have copied it here: https://pastebin.com/7bX4pRRG . I'm trying to understand how this an input to go into a single neuron. I see there are 4 inputs and each of those inputs goes to a 3 neuron dense layer (3*4 = 12 neurons?). I'm trying to use this as a very very basic feature engineering algorithm but I don't think the multiple dense layers correspond with each other or do they? What I'm trying to do is force each input to go to one of the 3 neurons in the dense layer. Is that what this is doing? — O.rka, Aug 05 '20 at 23:27
Yes each input goes though one neuron in the fist layer. Since the dense layer has 3 neurons it will get 3 outputs. If you want single output then you will have to define dense layer with 1 neurons using `Dense(1)`. Updated the answer with diagram. — mujjiga, Aug 08 '20 at 13:50
I appreciate you for adding the schematic above. It's much easier to see what is going on. Just so I understand, the first neuron in each of the columns in the diagram are separate neurons, correct? In this, the output y1-y2,-y3 are the 3 iris species? — O.rka, Aug 08 '20 at 17:43
Yes all are separate and you have total of 12 (4 * 3) in the first layer. yes the outputs correspond to 3 iris classes. — mujjiga, Aug 08 '20 at 20:06

rvinas · Answer 2 · 2020-08-05T10:56:49.087

0

Dense layers in Keras/TF are fully connected layers. For example, when you use a Dense layer as follows

inputs = tf.keras.Input(shape=(4))
x = tf.keras.layers.Dense(3)(inputs)

all the 4 connected input neurons are connected to all the 3 output neurons.

There isn't any predefined layer in Keras/TF to specify how to connect input and output neurons. However, Keras/TF is very flexible in that it allows you to define your custom layers easily.

Borrowing the idea from this answer, you could define a CustomConnected layer as follows:

class CustomConnected(tf.keras.layers.Dense):

    def __init__(self, units, connections, **kwargs):
        self.connections = connections
        super(CustomConnected, self).__init__(units, **kwargs)

    def call(self, inputs):
        self.kernel = self.kernel * self.connections
        return super(CustomConnected, self).call(inputs)

Using this layer, you can then specify the connections between two layers through the connections argument. For example:

inputs = tf.keras.Input(shape=(4))
connections = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 0, 1]])
x = CustomConnected(3, connections)(inputs)

Here, the 1st, 2nd, and 3rd input neurons are connected to the 1st, 2nd, and 3rd output neurons, respectively. Additionally, the 4th input neuron is connected to the 3rd output neuron.

UPDATE: As discussed in the comments section, an adaptive approach (e.g. by using only the maximum weight for each output neuron) is also possible but not recommended. You could implement this via the following layer:

class CustomSparse(tf.keras.layers.Dense):

    def __init__(self, units, **kwargs):
        super(CustomSparse, self).__init__(units, **kwargs)

    def call(self, inputs):
        nb_in, nb_out = self.kernel.shape
        argmax = tf.argmax(self.kernel, axis=0)  # Shape=(nb_out,)
        argmax_onehot = tf.transpose(tf.one_hot(argmax, depth=nb_in))  # Shape=(nb_in, nb_out)
        kernel_max = self.kernel * argmax_onehot
        # tf.print(kernel_max)  # Uncomment this line to print the weights
        out = tf.matmul(inputs, kernel_max)

        if self.bias is not None:
            out += self.bias

        if self.activation is not None:
            out = self.activation(out)

        return out

The main issue of this approach is that you cannot propagate gradients through the argmax operation required to select the maximum weight. As a result, the network will only "switch input neurons" when the selected weight is no longer the maximum weight.

edited Aug 05 '20 at 10:56

answered Aug 02 '20 at 15:41

rvinas

11,824
36
58

Thank you this is awesome. Is it trainable? For example, is there a way to do some type of argmax and one hot encoding in the custom layer? – O.rka Aug 02 '20 at 16:46
Yes, this is trainable. You can basically do whatever you want in the `call` method of your custom layer. However, argmax/one-hot vectors are not differentiable operations, so you will not be able to backpropagate the gradients and train the model via gradient descent. Usually, you compute these operations on the numpy output of your model (e.g. `model.predict` returns a numpy array with the probabilities of each class) – rvinas Aug 02 '20 at 18:29
How would you make it differentiable? Do you think a softmax operation could do the trick here? Are you suggesting that the connections should be manually input each time? Can you show an example of where you don't provide the connections manually? – O.rka Aug 03 '20 at 06:31
You cannot make those operations differentiable. Softmax is differentiable - how would you like to use it in the custom layer exactly? If you want to ensure that e.g. one input neuron contributes to one output neuron then, yes, connections should be manually input (but only once, when you instantiate the model). An example where you don't provide the connections manually would be a dense layer but in that case, you won't have the sparse connections that you are looking for. – rvinas Aug 03 '20 at 08:10
I'm trying to figure how to condense 4 features into 3 features and have the DNN decide which features to merge but I'm trying to make it so each feature contributes to only one neuron in the dense layer (only the first layer matters for this). Maybe I can just use tf.maximum? As this would propogate only the maximum weight to the next layer correct? – O.rka Aug 03 '20 at 17:11
I understand what you mean, although I don't think it is a good idea because taking the maximum weight will make the training much harder. Nonetheless, I included an update of the answer to show how this could be done. In my opinion, it is much better to specify the connections in advance - you will end up with the same network structure (up to symmetries) and it will be easier to train. – rvinas Aug 05 '20 at 10:55
@O.rka did the answer help? I am happy to make adjustments or clarifications. – rvinas Aug 06 '20 at 17:13
I'm going to dive into it right to try and understand. Thanks for explanations and the updated answer! – O.rka Aug 06 '20 at 18:34
Right now I'm trying to figure out how to run the code above. Does this need predefined connections as input or is there a way to initialize these? I'm wondering if maybe I can try a https://www.tensorflow.org/probability/api_docs/python/tfp/layers/IndependentBernoulli where each feature first goes through a bernouli distribution to see if it will go to one of the three neurons in the dense layer. The tricky part about this is how to make sure only one neuron gets activated – O.rka Aug 06 '20 at 18:48
Do you think something like this would work? https://i.imgur.com/UtAJ70d.png – O.rka Aug 06 '20 at 19:05
The second approach does not need predefined connections. What is the problem of predefined connections? If only one input neuron is active per output neuron, you could always permute the output neurons and get the same network architecture. – rvinas Aug 07 '20 at 18:26
I understand the idea behind the Bernoulli but, in the end, the problem is always the same - it is not possible to backpropagate the gradient through the non-differentiable operations that would be required to select the neuron that gets activated. – rvinas Aug 07 '20 at 18:28
Hmm... yes, I understand what you’re saying about the predefined connections but supplying the connections a priori, I think, defeats the purpose of using NNs as a feature engineering method. In that event, the NN would choose which combinations of features combined mutually exclusively could result in the best performance in accuracy. The accuracy would be a proxy for how good the feature engineering is. If we supply connections a priori then I feel that we would just be making a classifier with an unnecessary step. Does this make sense or am I overly simplifying something more complex? – O.rka Aug 07 '20 at 21:04

How to control if input features contribute exclusively to one neuron in subsequent layer of a Tensorflow neural network?

2 Answers2

Working Sample

Pictorial representation of the above model architecture

Linked