This is the third part of ‘A Brief History of Neural Nets and Deep Learning’. Parts 1 and 2 are here and here, and part 4 is here. In this part, we will continue to see the swift pace of research in the 90s, and see why neural nets ultimately lost favor much as they did in the late 60s.
Neural Nets Make Decisions
Having discovered the application of neural nets to unsupervised learning, let us also quickly see how they were used in the third branch of machine learning: reinforcement learning. This one requires the most mathy notation to explain formally, but also has a goal that is very easy to describe informally: learn to make good decisions. Given some theoretical agent (a little software program, for instance), the idea is to make that agent able to decide on an action based on its current state, with the reception of some reward for each action and the intent of getting the maximum utility in the long term. So, whereas supervised learning tells the learning algorithm exactly what it should learn to output, reinforcement learning provides ‘rewards’ as a by-product of making good decisions over time, and does not directly tell the algorithm the correct decisions to choose. From the outset it was a very abstracted decision making model - there was a finite number of states, and a known set of actions with known rewards for each state. This made it easy to write very elegant equations for finding the optimal set of actions, but hard to apply to real problems - problems with continuous states or hard to define reward.These disparate applications in other fields are certainly cool, but of course the most research on reinforcement learning and neural nets was happening within AI and Machine Learning. And here, one of the most significant results in the history of reinforcement learning was achieved: a neural net that learned to be a world class backgammon player. Dubbed TD-Gammon, , the neural net was trained using a standard reinforcement learning algorithm and was one of the first demonstrations of reinforcement learning being able to outperform humans on relatively complicated tasks
Neural Nets Get Loopy
As neat as unsupervised and reinforcement learning are, I think supervised learning is still my favorite use case for neural nets. Sure, learning probabilistic models of data is cool, but it’s simply much easier to get excited for the sorts of concrete problems solved by backpropagation. We already saw how Yann Lecun achieved quite good recognition of handwritten text (a technology which went on to be nationally deployed for check-reading, and much more a while later…), but there was another obvious and greatly important task being worked on at the same time: understanding human speech.As with writing, understanding human speech is quite difficult due to the practically infinite variation in how the same word can be spoken. But, here there is an extra challenge: long sequences of input. See, for images it’s fairly simple to crop out a single letter from an image and have a neural net tell you what letter that is, input->output style. But with audio it’s not so simple - separating out speech into characters is completely impractical, and even finding individual words within speech is less simple. Plus, if you think about human speech, generally hearing words in context makes them easier to understand than being separated. While this structure works quite well for processing things such as images one at a time, input->output style, it is not at all well suited to long streams of information such as audio or text. The neural net has no ‘memory’ with which an input can affect another input processed afterward, but this is precisely how we humans process audio or text - a string of word or sound inputs, rather than a single large input. Point being: to tackle the problem of understanding speech, researchers sought to modify neural nets to process input as a stream of input as in speech rather than one batch as with an image.
One approach to this, by Alexander Waibel et. al (including Hinton), was introduced in the 1989 “Phoneme recognition using time-delay neural networks”
Again, those seeking greater insight into the distinctions between different neural nets would do well to just go back to the actual papers. Here is a nice summation of why RNNs are cooler than TDNNs for sequential data: "A recurrent network has cycles in its graph that allow it to store information about past inputs for an amount of time that is not fixed a priori but rather depends on its weights and on the input data. The type of recurrent networks considered here can be used either for sequence recognition production or prediction units are not clamped and we are not interested in convergence to a fixed point. Instead the recurrent network is used to transform an input sequence eg speech spectra into an output sequence eg degrees of evidence for phonemes. The main advantage of such recurrent networks is that the relevant past context can be represented in the activity of the hidden units and then used to produce to compute the output at each time step. In theory the network can learn how to extract the relevant context information from the input sequence In contrast in network with time delays such as TDNNs the designer of the network must decide a priori by the choice of delay connections which part of the past input sequence should be used to predict the next output. According to the terminology introduced in the memory is static in the case of TDNNs but it is adaptive in the case of recurrent networks."
“Although recurrent networks can in many instances outperform static networks, they appear more difficult to train optimally. Our experiments tended to indicate that their parameters settle in a suboptimal solution which takes into account short term dependencies but not long term dependencies. For example in experiments described in we found that simple duration constraints on phonemes had not at all been captured by the recurrent network. … Although this is a negative result, a better understanding of this problem could help in designing alternative systems for learning to map input sequences to output sequences with long term dependencies eg for learning finite state machines, grammars, and other language related tasks. Since gradient based methods appear inadequate for this kind of problem we want to consider alternative optimization methods that give acceptable results even when the criterion function is not smooth.”
A New Winter Dawns
So, there was a problem. A big problem. And the problem, basically, was what so recently was a huge advance: backpropagation. See, convolutional neural nets were important in part because backpropagation just did not work well normal neural nets with many layers. And that’s the real key to deep-learning - having many layers, in today’s systems as many as 20 or more. But already by the late 1980’s, it was known that deep neural nets trained with backpropagation just did not work very well, and particularly did not work as well as nets with fewer layers. The reason, in basic terms, is that backpropagation relies on finding the error at the output layer and successively splitting up blame for it for prior layers. Well, with many layers this calculus-based splitting of blame ends up with either huge or tiny numbers and the resulting neural net just does not work very well - the ‘vanishing or exploding gradient problem’. Jurgen Schmidhuber, another Deep Learning luminary, summarizes the more formal explanation well“A diploma thesis (Hochreiter, 1991) represented a milestone of explicit DL research. As mentioned in Sec. 5.6, by the late 1980s, experiments had indicated that traditional deep feedforward or recurrent networks are hard to train by backpropagation (BP) (Sec. 5.5). Hochreiter’s work formally identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With standard activation functions (Sec. 1), cumulative backpropagated error signals (Sec. 5.5.1) either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers or CAP depth (Sec. 3), or they explode. “Backpropagation through time is essentially equivalent to a neural net with a whole lot of layers, so RNNs were particularly difficult to train with Backpropagation. Both Sepp Hochreiter, advised by Schmidhuber, and Yoshua Bengio published papers on the inability of learning long-term information due to limitations of backpropagation
“The basic LSTM idea is very simple. Some of the units are called Constant Error Carousels (CECs). Each CEC uses as an activation function f, the identity function, and has a connection to itself with fixed weight of 1.0. Due to f’s constant derivative of 1.0, errors backpropagated through a CEC cannot vanish or explode (Sec. 5.9) but stay as they are (unless they “flow out” of the CEC to other, typically adaptive parts of the NN). CECs are connected to several nonlinear adaptive units (some with multiplicative activation functions) needed for learning nonlinear behavior. Weight changes of these units often profit from error signals propagated far back in time through CECs. CECs are the main reason why LSTM nets can learn to discover the importance of (and memorize) events that happened thousands of discrete time steps ago, while previous RNNs already failed in case of minimal time lags of 10 steps.”But, this did little to fix the larger perception problem that neural nets were janky and did not work very well. They were seen as a hassle to work with - the computers were not fast enough, the algorithms were not smart enough, and people were not happy. So, around the mid 90s, a new AI Winter for neural nets began to emerge - the community once again lost faith in them. A new method called Support Vector Machines, which in the very simplest terms could be described as a mathematically optimal way of training an equivalent to a two layer neural net, was developed and started to be seen as superior to the difficult to work with neural nets. In fact, the 1995 “Comparison of Learning Algorithms For Handwritten Digit Recognition”
“The [support vector machine] classifier has excellent accuracy, which is most remarkable, because unlike the other high performance classifiers, it does not include a priori knowledge about the problem. In fact, this classifier would do just as well if the image pixels were permuted with a fixed mapping. It is still much slower and memory hungry than the convolutional nets. However, improvements are expected as the technique is relatively new.”Other new methods, notably Random Forests, also proved to be very effective and with lovely mathematical theory behind them. So, despite the fact that CNNs consistently had state of the art performance, enthusiasm for neural nets dissipated and the machine learning community at large once again disavowed them. Winter was back. In part 4, we shall see how a small group of researchers persevered in this research climate and ultimately made Deep Learning what it is today.
-
Anderson, C. W. (1989). Learning to control an inverted
pendulum using neural networks. Control Systems Magazine, IEEE, 9(3),
31-37. ↩
-
Narendra, K. S., & Parthasarathy, K. (1990). Identification
and control of dynamical systems using neural networks. Neural
Networks, IEEE Transactions on, 1(1), 4-27. ↩
-
Lin, L. J. (1993). Reinforcement learning for robots using
neural networks (No. CMU-CS-93-103). Carnegie-Mellon Univ Pittsburgh PA
School of Computer Science. ↩
-
Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), 58-68. ↩
-
Thrun, S. (1995). Learning to play the game of chess. Advances in neural information processing systems, 7. ↩
-
Schraudolph, N. N., Dayan, P., & Sejnowski, T. J. (1994).
Temporal difference learning of position evaluation in the game of Go.
Advances in Neural Information Processing Systems, 817-817. ↩
-
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang,
K. J. (1989). Phoneme recognition using time-delay neural networks.
Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(3),
328-339. ↩
-
Yann LeCun and Yoshua Bengio. 1998. Convolutional networks for
images, speech, and time series. In The handbook of brain theory and
neural networks, Michael A. Arbib (E()d.). MIT Press, Cambridge, MA, USA
255-258. ↩
-
Yoshua Bengio, A Connectionist Approach To Speech Recognition Int. J. Patt. Recogn. Artif. Intell., 07, 647 (1993). ↩
-
J. Schmidhuber. “Deep Learning in Neural Networks: An
Overview”. “Neural Networks”, “61”, “85-117”.
http://arxiv.org/abs/1404.7828 ↩ ↩2
-
Hochreiter, S. (1991). Untersuchungen zu dynamischen
neuronalen Netzen. Diploma thesis, Institutfur Informatik, Lehrstuhl
Prof. Brauer, Technische Universitat Munchen. Advisor: J. Schmidhuber. ↩
-
Bengio, Y.; Simard, P.; Frasconi, P., “Learning long-term
dependencies with gradient descent is difficult,” in Neural Networks,
IEEE Transactions on , vol.5, no.2, pp.157-166, Mar 1994 ↩
-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term
Memory. Neural Comput. 9, 8 (November 1997), 1735-1780.
DOI=http://dx.doi.org/10.1162/neco.1997.9.8.1735. ↩
-
Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S.
Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard and
V. Vapnik: Comparison of learning algorithms for handwritten digit
recognition, in Fogelman, F. and Gallinari, P. (Eds), International
Conference on Artificial Neural Networks, 53-60, EC2 & Cie, Paris,
1995 ↩
No comments:
Post a Comment