Unlike multi-layer perceptrons, recurrent networks can use their internal memory to process sequences of arbitrary length. Therefore, RNN networks are applicable in such where something is divided into segments, for example, handwriting recognition or speech recognition. Many different architectural solutions for recurrent networks, from simple to complex, have been proposed. Recently, the most common network with long-term and short-term memory (LSTM) and controlled recurrent unit (GRU).

In the diagram above ​​the neural network A receives some data X at the input and outputs some value h. The cyclic connection in RNNs allows to transfer information from the current network step to the next. There are many varieties, solutions and constructive elements of recurrent neural networks. The difficulty of the recurrent network is that if you take into account each time step, it becomes necessary for each time step to create its own layer of neurons, which causes serious computational complexity. In addition, multilayer implementations are computationally unstable, since they tend to disappear or go off the scale. If we restrict the calculation to a fixed time window, then the resulting models will not reflect long-term trends. Various approaches are trying to improve the model of memory and the mechanism of remembering and forgetting.

Recurrent neural networks are not so different from ordinary neural networks. They can be imagined as multiple copies of the same network, and each copy transmits a message to the next copy. See what happens if we unfold the cycle:

Such a “chain” entity shows that recurrent neural networks are by their nature closely related to sequences and lists.

 

Fully recurrent network

This basic architecture of RNNs  was developed in the 1980s. The network is built from nodes, each of which is connected to all other nodes. For each neuron, the activation threshold varies with time and is a real number. Each compound has a variable real weight. Nodes are divided into input, output and hidden.

For supervised training with discrete time, each (discrete) time step data is fed to the input nodes, and other nodes complete their activation, and the output signals are prepared for transmission to the next level by the neuron. If, for example, the network is responsible for speech recognition, as a result, labels (recognized words) are already received on the output nodes.

In reinforced learning, there is no teacher who provides target signals for the network; instead, the fitness function is sometimes used or the reward function, which evaluates the quality of the network, and the output values ​​affect network behavior at the input. In particular, if the network implements the game, the output is measured by the number of points of winning or position evaluation. Each chain calculates the error as the total deviation of the output signals of the network. If there is a set of training patterns, the error is calculated taking into account the errors of each individual sample.

The problem of long-term dependencies

One of the ideas that makes RNS so compelling is that they could use the information received in the past for current tasks. For example, they could use previous video frames to understand subsequent ones. Sometimes we have enough recent information to perform the current task. For example, imagine a language model that tries to predict the next word based on the previous ones. If we try to predict the last word in the sentence “Clouds in the sky”, we no longer need any context – it is quite obvious that at the end of the sentence we are talking about heaven. In such cases, where there is a small gap between the necessary information and the place where it is needed, RNS can learn to use the information obtained earlier.

 

But there are also times when we need a wider context. Suppose you need to predict the last word in the text “I grew up in France … I speak French fluently”. Recent information suggests that the next word is probably the name of the language, but if we want to clarify which one, we need the previous context, up to information about France. It is not rare that the gap between the necessary information and the place where it is needed becomes very large. Unfortunately, as the gap grows, the RNS becomes unable to learn how to connect information.

Theoretically, RNS can handle such long-term dependencies. A person can carefully select their parameters to solve toy problems of this form. However, in practice, RNS are not able to learn this.