Long-short-term memory (LSTM) networks are a special type of recurrent neural networks capable of learning long-term dependencies. They work incredibly well on a large variety of problems and are currently widely used. LSTMs are specifically designed to avoid the problem of long-term dependencies.  In LSTM networks, it was possible to circumvent the problem of the vanishing error gradients in the network training process by method of error back propagation. An LSTM network is usually controlled by recurrent gates called “forgetting” gates. Errors are propagated back in time through a potentially unlimited number of virtual layers. In this way, learning takes place in LSTM, while preserving the memory of thousands and even millions of time intervals in the past. Network topologies such as LSTM can be developed in accordance with the specifics of the task. In an LSTM network, even large delays between significant events can be taken into account, and thus high-frequency and low-frequency components can be mixed.

All recurrent neural networks are in the form of a chain of repeating modules of a neural network. In a standard recurrent neural networks, these repeating modules will have a very simple structure, for example, only one layer of hyperbolic tangent (tanh).

LSTM also have such a chain structure, but the repeating module has a different structure. Instead of one neural layer there are four of them, and they interact in a special way.

In the diagram above, each line transmits a whole vector from the output of one node to the inputs of the others. Pink circles represent pointwise operators, such as vector addition, while yellow rectangles are trained layers of the neural network. Merging lines indicate concatenation, while branching lines indicate that their contents are copied, and copies are sent to different places.

The main idea of LSTM

The key to LSTM is the cell state — the horizontal line through the top of the diagram. Cellular state is something like a conveyor belt. It moves straight along the whole chain with only small linear interactions. Information can simply flow through it unchanged.

LSTM has the ability to remove or add information to a cellular state, but this ability is carefully regulated by structures called gates. Gates are a way to selectively skip information. They are composed of a sigmoid layer of the NA and pointwise multiplication operations.

The sigmoid layer outputs numbers between zero and one, thus describing how much each component should be passed through the valve. Zero – “do not miss anything”, one – “skip all”. LSTM has three such valves to protect and control the cellular state.

The first step in our LSTM is to decide what information we are going to throw out from the cellular state. This decision is made by a sigmoid layer called the “forget gate layer”. It receives the input values ​and outputs a number between 0 and 1. The unit means “save it completely”, while zero means “get rid of it completely”.

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cellular state may include the gender of the subject, which will allow the use of regular pronouns. When we see a new subject, we forget the genus of the previous subject.

The next step is to decide what new information we are going to keep in the cellular state. This step consists of two parts. First, the sigmoid layer, called the “input gate layer”, decides which values ​​we update. Further, the layer of the gyperabolic tangent creates a vector of candidates for new values ​​of σt, which can be added to the state. In the next step, we will connect these two parts to create an update for the state.

In the example of our language model, we would like to add the genus of the new subject to the cellular state in order to replace the genus of the old, which we must forget.

Now it’s time to update the old cellular state with the new cellular state. All decisions have already been made in the previous steps, it remains only to do it. We multiply the old state by ft, forgetting everything that we previously decided to forget. In the case of the language model, this is exactly the place where we lose information about the gender of the old subject and add new information, as we decided in the previous steps.

Finally, we need to decide what result we are going to apply for the exit. This result will be based on our cellular state, but will be its filtered version. First we launch the sigmoid layer, which decides which parts of the cellular state we are going to send out. Then we pass the cellular state through the hyperbolic tangent (tanh) (to fit the values ​​in the interval from −1 to 1) and multiply it by the output of the sigmoid valve, so we send only the parts we want to exit.

In the example with the language model, if she had just seen the subject, she could submit information relating to the verb to the output (in case the next word is a verb). For example, she may submit the number of the subject (singular or plural). Thus, we will know which form of the verb should be substituted (unless of course the verb goes on).