8

Bill Gates has recently said:

The ultimate is computers that learn. So called deep learning which started at Microsoft and is now being used by many researchers looks like a real advance that may finally learn. It has already made a big difference in video and audio recognition - more progress in the last 3 years than ever before.

The statement was quoted in many news articles.

How true is it that Deep Learning started at Microsoft?

Sklivvz
  • 78,578
  • 29
  • 321
  • 428
Franck Dernoncourt
  • 7,224
  • 4
  • 33
  • 76

1 Answers1

9

That is incorrect. It might be true that Microsoft pioneered commercial applications of Deep Learning though, which is what I suspect Bill was getting at.

Li Deng, who works for Microsoft Research, took the Deep Belief Nets (DBNs), devised by Hinton and his team at the University of Toronto, and applied them (successfully) to the TIMIST dataset for speech recognition. This got Deep Learning a lot of interest, in the commercial sector.

DBNs were the start of a resurgence of interest in deep learning. Originally when neural nets were created in 1986, they were often deep, in the 90's something called the Universal Approximation theorem was proven, which roughly says "A neural net with one hidden layer (that is sufficiently large) can approximate (given sufficient training data) any continuous function." This, combined with the difficulty in training deeper nets, basically ended Deep Learning. Hinton's paper in 2006, and Bengio's monograph in 2007, sparked a resurgence of interest, because Hinton showed a novel new technique that made training deep nets feasible (this technique being Deep Belief Networks), and Bengio argued that deep nets were important, and that we could get many advantages over shallow nets.

Li Deng at Microsoft research took this onboard, and with his work on TIMIST, showed the world that Deep Learning was feasible and great.

--

DBN's in academia

(Still trying to date Li Deng's work. citations needed.)

  • Hinton's early work was published in 1986 in http://mitpress.mit.edu/books/parallel-distributed-processing. That's one of the earliest examples of multi-layer non-linear perceptron training. –  Feb 12 '14 at 00:22
  • See also: http://www.nature.com/nature/journal/v323/n6088/abs/323533a0.html. –  Feb 12 '14 at 00:27
  • @Articuno: Deep Learning Typically refers to 3 or more hidden layers. The Universal Approximation Theorem, was not proven til much later. (For reference my goto paper for MLP's is Hinton, Rumelhart et al's paper "Learninging Represtentiosn by back-properagating errors" in nature 1986) – Frames Catherine White Feb 12 '14 at 00:29
  • That's what I linked to :) Also, the models introduced in 1986 were not limited to single hidden layers. My point in those comments was giving your references that you could use to establish that deep learning has been happening for at least as long as your answer claims. –  Feb 12 '14 at 00:32
  • Yeah, you ninjaed me. Posted will i was still trying to find the link. Problem with calling the stuff that was happening in the late 80's deep learning is that it was only deep becuase the Universal approximation theorem hadn't been proven. (oh bother now I will have to add that ot my answer) – Frames Catherine White Feb 12 '14 at 00:41
  • The things in the 80s were deep in that they could use several layers. The fact that they *could* be represented by a single layer is addressed by Bengio in [Learning Deep Architectures for AI](http://www.iro.umontreal.ca/~bengioy/papers/ftml.pdf). Its easy to construct examples of functions where a shallow MLP would require a number of units exponential in the number of inputs while a deep MLP would not. –  Feb 12 '14 at 00:52
  • Okay, now you've ninja'd me :P –  Feb 12 '14 at 00:52
  • While MLPs can have multiple layers, the back-propagation algorithm tends not to work very well as the errors back-propagated through the network begin to get very diffuse the further back through the network you go. So the PDP books are not really describing "deep" learning. Being able to represent a function compactly is one thing, being able to determine the weights that actually achieve that is often rather more difficult! The funny thing is that there is also interest in very shallow networks, such as extreme learning machines and SVMs etc. Good answer (+1). –  Feb 13 '14 at 16:57