The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions

@article{Hochreiter1998TheVG,
  title={The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions},
  author={Sepp Hochreiter},
  journal={Int. J. Uncertain. Fuzziness Knowl. Based Syst.},
  year={1998},
  volume={6},
  pages={107-116},
  url={https://api.semanticscholar.org/CorpusID:18452318}
}
  • Sepp Hochreiter
  • Published in 1 April 1998
  • Computer Science
  • Int. J. Uncertain. Fuzziness Knowl. Based Syst.
The de-caying error flow is theoretically analyzed, methods trying to overcome vanishing gradients are briefly discussed, and experiments comparing conventional algorithms and alternative methods are presented.

Linear Antisymmetric Recurrent Neural Networks

This paper suggests a new recurrent network structure called Linear Antisymmetric RNN (LARNN), based on the numerical solution to an Ordinary Differential Equation (ODE) with stability properties resulting in a stable solution, which corresponds to long-term memory.

Learning Long Term Dependencies with Recurrent Neural Networks

It is shown that RNNs and especially normalised recurrent neural networks (NRNNs) unfolded in time are indeed very capable of learning time lags of at least a hundred time steps and it is demonstrated that the problem of a vanishing gradient does not apply to these networks.

Learning Longer Memory in Recurrent Neural Networks

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, by using a slight structural modification of the simple recurrent neural network architecture.

Short-Term Memory Optimization in Recurrent Neural Networks by Autoencoder-based Initialization

An initialization schema that pretrains the weights of a recurrent neural network to approximate the linear autoencoder of the input sequences is introduced and it is shown how such pretraining can better support solving hard classification tasks with long sequences.

Reinforcement learning with recurrent neural networks

RNN can well map and reconstruct (partially observable) Markov decision processes and the resulting inner state of the network can be used as a basis for standard RL algorithms, which forms a novel connection between recurrent neural networks (RNN) and reinforcement learning (RL) techniques.

Backpropagation-decorrelation: online recurrent learning with O(N) complexity

    Jochen J. Steil
    Computer Science
  • 2004
A new learning rule for fully recurrent neural networks is introduced which combines important principles: one-step backpropagation of errors and the usage of temporal memory in the network dynamics by means of decorrelation of activations.

Learning from Predictions: Fusing Training and Autoregressive Inference for Long-Term Spatiotemporal Forecasts

The results show that BPTT-SA effectively reduces iterative error propagation in convolutional RNNs and Convolutional Autoencoder Rnns, and demonstrates its capabilities in long-term prediction of high-dimensional fluid flows.

Decoupling Hierarchical Recurrent Neural Networks With Locally Computable Losses

It is empirically show that in (deep) HRNNs, propagating gradients back from higher to lower levels can be replaced by locally computable losses, without harming the learning capability of the network, over a wide range of tasks.

Using recurrent networks for non-temporal classification tasks

This paper investigates the use of recurrent neural networks as an alternative to deep architectures and shows that for a comparable numbers of parameters or complexity, replacing depth with recurrency can result in improved performance.
...

Learning long-term dependencies with gradient descent is difficult

This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.

Learning State Space Trajectories in Recurrent Neural Networks

A procedure for finding E/wij, where E is an error functional of the temporal trajectory of the states of a continuous recurrent network and wij are the weights of that network, which seems particularly suited for temporally continuous domains.

Gradient calculations for dynamic recurrent neural networks: a survey

The author discusses advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones and presents some "tricks of the trade" for training, using, and simulating continuous time and recurrent neural networks.

Learning Complex, Extended Sequences Using the Principle of History Compression

A simple principle for reducing the descriptions of event sequences without loss of information is introduced and this insight leads to the construction of neural architectures that learn to divide and conquer by recursively decomposing sequences.

Learning long-term dependencies in NARX recurrent neural networks

It is shown that the long-term dependencies problem is lessened for a class of architectures called nonlinear autoregressive models with exogenous (NARX) recurrent neural networks, which have powerful representational capabilities.

Credit Assignment through Time: Alternatives to Backpropagation

This work considers and compares alternative algorithms and architectures on tasks for which the span of the input/output dependencies can be controlled and shows performance qualitatively superior to that obtained with backpropagation.

Long Short-Term Memory

A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

Learning Sequential Structure with the Real-Time Recurrent Learning Algorithm

A more powerful recurrent learning procedure, called real-time recurrent learning2,6 (RTRL), is applied to some of the same problems studied by Servan-Schreiber, Cleeremans, and McClelland and revealed that the internal representations developed by RTRL networks revealed that they learn a rich set of internal states that represent more about the past than is required by the underlying grammar.

LSTM can Solve Hard Long Time Lag Problems

This work shows that problems used to promote various previous algorithms can be solved more quickly by random weight guessing than by the proposed algorithms, and uses LSTM, its own recent algorithm, to solve a hard problem.

Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks

These simulations suggest that recurrent controller networks trained by Kalman filter methods can combine the traditional features of state-space controllers and observers in a homogeneous architecture for nonlinear dynamical systems, while simultaneously exhibiting less sensitivity than do purely feedforward controller networks to changes in plant parameters and measurement noise.