0

Hello fellow tensorflowians!

I have a following schema:

I input some continous variables (actually, word embeddings I took from google word2vec), and I am trying to predict output that can be considered as continous as well as discrete (sorry, mathematicians! but it depends on one's training goal actually). Output takes values from 0 to 1000 with interval of 0.25 (or a precision hyperparameter), so : 0, 0.25, 0.50, ..., 100.0 .

I know that it is not possible to include something like tf.to_int (I can omit fractions if it's necessary) or tf.round, because these are not differentiable, so we can't backpropagate. However, I feel that there is some solution that allows network to "know" that it is searching for rounded solution: some small fractions of integers like 0.25, 5.75, but I actually don't even know where to look. I looked up quantization, but that seems to be a bit of an overkill.

So my question is:

  1. How to inform graph that we don't accept values below 0.0 ? Would doing abs on network output "logits" (regression predictions) be something worth considering? If no, can I modify the loss term to severely punish scores below 0 and using absolute error instead of squared error? I may be not aware of full consequences of doing that
  2. I don't care whether prediction of 4.5 is 4.49999 or 4.4 because I round up predictions to nearest .25 to get accuracy, and that's my final model evaluation metric. If so, can I use?

precision = 0.01 # so that sqrt(precision) == 0.1 loss=tf.reduce_mean(tf.max(0, tf.square(tf.sub(logits, targets)) - precision ))

mhnatiuk
  • 148
  • 2
  • 11
  • Do your discrete values have any inherent ordering? Is a value of 0.5 "closer" to a value of 0.75 than it is to a value of 1000? Or are they all equally distinct? – kbrose Oct 21 '16 at 17:32
  • Ineed! They are acutally numbers, but they have lower "resolution", mostly values like: 0.5, 1.0, 4.0, 2.5, 10.5. I woulde like to quantize them so the precision of float would be at max 0,1, so for range (0,100) (whcich is acceptable in my usecase) it would be 1000 bins. I just don't know how to include this in gradient calculations so that optimizer operation would be aware of it and not producing real numbers like 1.778 or 1.9997, but rather -> 1.75, 2.0. Of course I could just round up regression scores after training, but that would introduce error i can't account for in my model. – mhnatiuk Oct 21 '16 at 20:59
  • And I would like to still use something like mean squared error as loss, since it does make a difference whether algorithm makes mistake classifying actual 1.0 as 2.0 or as 100, so one-hot encoding doesn't seem like an option here – mhnatiuk Oct 21 '16 at 21:07
  • @kbrose do you have an idea how to tackle this one? Is that quantization? – mhnatiuk Oct 26 '16 at 18:32
  • Sorry, no concrete ideas. I would see how well it works if you just take the continuous output (trained with standard squared error) result and round to the nearest possible output (0, .25, etc.). I'm guessing that it will probably do fairly well... – kbrose Oct 27 '16 at 15:41
  • Yes, it does, but it doesn't take into account that target values are multi. of 0.05 - imagine it as a grid that "pulls" gradients toward predicting mutiples of 0.05. I was thinking about calculating two losses and then combinging them in some way, say: sq_losses* classifier_loss which, if classifer loss output 1 or 0 depening whether we "hit" an epsiolon around our target point(say 0.01). That calculation performed element-wise should output 0 loss for regression scores that are around our target (if it's 2.05, hitting 2.04 is enough as we're roundin scores to nearst .05. Any thoughts? – mhnatiuk Oct 27 '16 at 20:13
  • Without knowing more about your problem, I'd say try both and see which performs better on a validation set. Nothing better than some empirical results, right? – kbrose Oct 27 '16 at 20:35
  • I edited my post to provide more info – mhnatiuk Oct 28 '16 at 11:14
  • I didin't work so well after. I mean, of course, I get 90% of accuracy, however from statistics I can see that backpropagation of losses would be so much easier if optimizer would search for a "good enough" solution rather than treating 0 as the ultimate goal. However, I think it's impossible to define such a loss function that would consider making 0.05 error as making no error at all. I guess that can lead to unknown results actually – mhnatiuk Nov 08 '16 at 21:36

0 Answers0