Number of units in the last dense layer in case of binary classification

Question

My question is related to this one here. I am using the cats and dogs dataset. So there are only these two outcomes. I found two implementations. First one uses:

tf.keras.layers.Dense(1)

as the last layer in the model The second implementation uses:

layers.Dense(2)

Now, I don't understand this. What is correct here? Is it the same and why (I cannot see why this should be the same). Or what is the difference here? The first solution is modelling cat or dog, the second solution is modelling cat, dog, or any other? Why is this done if we only have cat and dog? Which solution should I take?

score 3 · Accepted Answer · answered Jul 11 '20 at 14:10

3

Both are correct. One is using binary classification and another one is using categorical classification. Let's try to find the differences.

Binary Classification: In this case, the output layer has only one neuron. From this single neuron output, you have to decide either it's a cat or a dog. You can set any threshold level to classify the output. Let's say cats are labeled as 0 and dogs are labeled as 1 and your threshold value is 0.5. So, if the output is greater than 0.5, then it's a dog because it's closer to 1 otherwise it's a cat. In this case, binary_crossentropy is being used for most of the cases.

Categorical Classification: The number of output layers are exactly the same as the number of classes. This time you're not allowed to label your data as 0 or 1. Label shape should be same as the output layer. In your case, your output layer has two neurons(for classes). You will have to label your data in the same way. To achieve this, you will have to encode your label data. We call this "one-hot-encode". the cats will be encoded as (1,0) and the dogs will be encoded as (0,1) for example. Now your prediction will have two floating-point numbers. If the first number is greater than the second, it's a cat otherwise it's a dog. We call this numbers - confidence score. Let's say, for a test image, your model predicted (0.70, 0.30). which means your model is 70% for confident that it's a cat and 30% confident that it's a dog. Please note that the value of the output layer completely depends on the activation of your layer. To know deeper, please read about activation functions.

answered Jul 11 '20 at 14:10

Nazmul Hasan

860
6
17

But both are applied to the same data. There is no further manipulation done to the labels in the case of categorical classification. So how can one apply this different code to the same data? So in both implementations data is loaded with tfds.load. Or is this so to say "implicitly" done when fitting the model? And I do not have to take care of this / think about what I want to use? – Stat Tistician Jul 11 '20 at 14:20
They were able to use the same format of data because of sparse_categorical_crossentropy. You can take a look here: https://stackoverflow.com/questions/58565394/what-is-the-difference-between-sparse-categorical-crossentropy-and-categorical-c – Nazmul Hasan Jul 11 '20 at 15:01
If you have only two classes, I would recommend binary one. If you have more than two classes, you can go with categorical one. – Nazmul Hasan Jul 11 '20 at 15:02
Thanks for your answer, however I still do not get how they can apply this to the same data if the labels must be in a different shape/encoded. So is this done automatically when fitting and I do not have to care about this? So I mean the data preprocessing steps in both implementation examples. – Stat Tistician Jul 12 '20 at 08:34
You need to know the difference between categorical-crossentropy and sparse_categorical_crossentropy.I already added another question link above. Please take a look – Nazmul Hasan Jul 12 '20 at 08:56
Yes, but this is just later when modile.compile is used. So my question is more like: Can I just apply either of this method, so model.fit with layer 1 and compile with binary or model.fit with layer 2 and compile with crossentropy and I do not have to worry about anything else, or do I have to do some data modifications in preprocessing depending on which one I am going to use later. Because, I would have expected that this is necessary, for example to get the labels into the right shape. However, this is not done, both implementations have the same data and – Stat Tistician Jul 12 '20 at 14:52
preprocessiing and then the difference is just this part in model.fit and model.compile? – Stat Tistician Jul 12 '20 at 14:52
Yes. In model.compile, the loss is the main difference. – Nazmul Hasan Jul 12 '20 at 16:11

Number of units in the last dense layer in case of binary classification

1 Answers1