Saturday, June 19, 2021

Advancing AI theory with a first-principles understanding of deep neural networks

 

 https://ai.facebook.com/blog/advancing-ai-theory-with-a-first-principles-understanding-of-deep-neural-networks/

 


 

 

The steam engine powered the Industrial Revolution and changed manufacturing forever — and yet it wasn’t until the laws of thermodynamics and the principles of statistical mechanics were developed over the following century that scientists could fully explain at a theoretical level why and how it worked.

Lacking theoretical understanding didn’t stop people from improving on the steam engine, of course, but discovering the principles of the heat engine led to rapid improvements. And when scientists finally grasped statistical mechanics, the ramifications went far beyond building better and more efficient engines. Statistical mechanics led to an understanding that matter is made of atoms, foreshadowed the development of quantum mechanics, and (if you take a holistic view) even led to the transistor that powers the computer you’re using today.

AI today is at a similar juncture. Deep neural networks (DNNs) are a fixture of modern AI research, but they are more or less treated as a “black box.” While substantial progress has been made by AI practitioners, DNNs are typically thought of as too complicated to understand from first principles. Models are fine-tuned largely by trial and error — and while trial and error can be done intelligently, often informed by years of experience, it is carried out without any unified theoretical language with which to describe DNNs and how they function.

Today we are announcing the publication of The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks, a collaboration between Sho Yaida of Facebook AI Research, Dan Roberts of MIT and Salesforce, and Boris Hanin at Princeton. At a fundamental level, the book provides a theoretical framework for understanding DNNs from first principles. For AI practitioners, this understanding could significantly reduce the amount of trial and error needed to train these DNNs. It could, for example, reveal the optimal hyperparameters for any given model without going through the time- and compute-intensive experimentation required today.

The Principles of Deep Learning Theory will be published by Cambridge University Press in early 2022 and the manuscript is now publicly available. “The book presents an appealing approach to machine learning based on expansions familiar in theoretical physics," said Eva Silverstein, a Professor of Physics at Stanford University. "It will be exciting to see how far these methods go in understanding and improving AI."

This is only the first step toward the much larger project of reimagining a science of AI, one that’s both derived from first principles and at the same time focused on describing how realistic models actually work. If successful, such a general theory of deep learning could potentially enable vastly more powerful AI models and perhaps even guide us toward a framework for studying universal aspects of intelligence.

Interacting neurons

Until now, theorists trying to understand DNNs typically relied on an idealization of such networks, the so-called infinite-width limit, in which DNNs are modeled with an infinite number of neurons per layer. Like the ideal gas law compared with a real gas, the infinite-width abstraction provides a starting point for theoretical analysis. But it often bears little resemblance to real-world deep learning models — especially neural networks of nontrivial depth, where the abstraction will deviate more and more from an accurate description. While occasionally useful, the infinite-width limit is overly simplistic and ignores many of the key features of real DNNs that make them such powerful tools.

Approaching the problem from a physicist’s perspective, The Principles of Deep Learning Theory improves on this infinite-width limit by laying out an effective theory of DNNs at finite width. Physicists traditionally aim to create the simplest and most ideal model possible that also incorporates the minimum complexity necessary for describing the real world. Here, that required backing off the infinite-width limit and systematically incorporating all the corrections needed to account for finite-width effects. In the language of physics, this means modeling the tiny interactions between neurons both in a layer and across layers.

These may sound like small changes, but the results are qualitatively different between the existing toy models and the one described in the book. Imagine two billiard balls heading toward each other. If you used a noninteracting model analogous to the infinite-width limit to calculate what was about to happen, you’d find that the balls pass right through each other and continue in the same direction. But obviously that’s not what happens. The electrons in the balls cannot occupy the same space, so they ricochet off each other.

Those interactions — however small they may be for individual electrons — are what prevent you from falling through your chair, through the floor, and straight toward the center of the earth. Those interactions matter in real life, they matter in physics, and they matter to DNNs as well.

-0:02
HD

Taking into account similar interactions between neurons, the book’s theory finds that the real power of DNNs — their ability to learn representations of the world from data — is proportional to their aspect ratio, i.e., the depth-to-width ratio. This ratio is zero for infinite-width models and thus these toy models fail to capture depth, and their description becomes less and less accurate as the depth of the DNNs increases. In contrast, working with finite-width layers, the effective theory actually factors in depth — which is vital for representation learning and other applications where the D of the DNN really matters.

"In physics, effective field theories are a rigorous and systematic way to understand the complex interactions of particles,” said Jesse Thaler, Associate Professor of Physics at MIT and Director of the NSF AI Institute for Artificial Intelligence and Fundamental Interaction. “It is exciting to see that a similarly rigorous and systematic approach applies to understanding the dynamics of deep networks. Inspired by these developments, I look forward to more fruitful dialogue between the physics and AI communities."

Opening the box

While the framework described in the book can extend to the real-world DNNs used by the modern AI community — and provides a blueprint for doing so — the book itself mostly focuses on the simplest deep learning models (deep multilayer perceptrons) for the purposes of instruction.

Applied to this simplest architecture, the equations of the effective theory can be solved systematically. This means that we can have a first-principles understanding of the behavior of a DNN over the entire training trajectory. In particular, we can explicitly write down the function that a fully trained DNN is computing in order to make predictions on novel test examples.

Armed with this new effective theory, we hope theorists will be able to push for a deeper and more complete understanding of neural networks. There is much left to compute, but this work potentially brings the field closer to understanding what particular properties of these models enable them to perform intelligently.

We also hope that the book will help the AI community reduce the cycles of trial-and-error that sometimes constrain current progress. We want to help practitioners rapidly design better models — more efficient, better performant, faster to train, or perhaps all of these. In particular, those designing DNNs will be able to pick optimal hyperparameters without any training, and choose the optimal algorithms and model architecture for best results.

These are questions that over the years many felt could never be answered or explained. The Principles of Deep Learning Theory demonstrates that AI isn’t an inexplicable art, and that practical AI can be understood through fundamental scientific principles.

Theory informing practice

Hopefully this is just the beginning. We plan to continue our research, extending our theoretical framework to other model architectures and acquiring new results. And on a broader level, we hope the book demonstrates that theory can provide an understanding of real models of practical interest.

"In the history of science and technology, the engineering artifact often comes first: the telescope, the steam engine, digital communication. The theory that explains its function and its limitations often appears later: the laws of refraction, thermodynamics, and information theory,” said Facebook VP and Chief AI Scientist Yann LeCun. “With the emergence of deep learning, AI-powered engineering wonders have entered our lives — but our theoretical understanding of the power and limits of deep learning is still partial. This is one of the first books devoted to the theory of deep learning, and lays out the methods and results from recent theoretical approaches in a coherent manner."

While empirical results have propelled AI to new heights in recent years, we firmly believe that practice grounded in theory could help accelerate AI research — and possibly lead to the discovery of new fields we can’t even conceive of yet, just as statistical mechanics led to the Age of Information over a century ago.

 

 

 

Sunday, June 13, 2021

Top 5 GPT-3 Successors You Should Know in 2021

 https://towardsdatascience.com/top-5-gpt-3-successors-you-should-know-in-2021-42ffe94cbbf

A brief summary of a handful of language models beyond GPT-3 with links to the associated original publications as well as specific articles by the author on some of these models.
Top 5 GPT-3 Successors You Should Know in 2021
0. GPT-3: Generative Pre-Trained language model (OpenAI)
1. Switch Transformer — The trillion-parameter pioneer (Google)
2. DALL·E — The creative artist (OpenAI)
3. LaMDA: Language Model for Dialogue Applications (Google)
4. MUM : Multitask Unified Model (Google)
5. Wu Dao 2.0 aka Enlightenment
 

GPT-3 is already old if we compare it with what AI is showing us this year. Since the transformer came out in 2017, it has seen wild success in diverse tasks, from language to vision. GPT-3 revolutionized the world last year and since then multiple breakthrough models have been presented. Countries and companies are immersed in a race to build better and better models.

The premise is that bigger models, bigger datasets, and more computational power comprise the AI-dominance trinity. Even if this paradigm has important detractors, its success is simply undeniable.

In this article, I’ll review the 5 most important transformer-based models from 2021. I’ll open the list with GPT-3 because of its immense significance, and then continue in chronological order — the last one was published just two weeks ago!

GPT-3 — The AI rockstar

OpenAI presented GPT-3 in May 2020 in a paper titled Language Models are Few-Shot Learners. In July 2020, the company released a beta API for developers to play and the model became an AI-rockstar overnight.

GPT-3 is the third version of a family of Generative Pre-Trained language models. Its main features are multitasking and meta-learning abilities. Being trained in an unsupervised way on 570GB of Internet text data, it’s able to learn tasks it hasn’t been trained on by seeing a few examples (few-shot). It can also learn from zero- and one-shot settings, but the performance is usually worse.

GPT-3 has demonstrated crazy language generation abilities. It can have conversations (impersonating historical figures, alive or dead), write poetry, songs, fiction, and essays. It can write code, music sheets, and LaTeX-formatted equations. It shows a modest level of reasoning, logic, and common sense. And it can ponder about the future, the meaning of life and itself.

Apart from this, GPT-3 showed great performance on standardized benchmarks, achieving SOTA in some of them. It shines the most on generative tasks, such as writing news articles. For this task, it reached human levels, confusing judges trying to separate its articles from human-made ones.

Here’s a complete overview I wrote about GPT-3 for Towards Data Science:

 

Switch Transformer — The trillion-parameter pioneer

In January 2021 Google published the paper Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. They presented the switch transformer, a new neural net which goal was facilitating the creation of larger models without increasing computational costs.

The feature that distinguishes this model from previous ones is a simplification of the Mixture of Experts algorithm. Mixture of Experts (MoE) consist of a system by which tokens (elemental parts of the input) entering the model are sent to be processed by different parts of the neural net (experts). Thus, to process a given token, only a subsection of the model is active; we have a sparse model. This reduces the computational costs, allowing them to reach the trillion-parameter mark.

With the original MoE, each token was sent to at least two experts to compare them. With the Switch Transformer, Google simplified the routing process so each token is sent to only one expert. This further reduces computational and communication costs. Google showed that a large Switch Transformer would outperform a large dense model (such as GPT-3, although they didn’t compare the two). This is a huge milestone in reducing the carbon footprint of large pre-trained models, which are state-of-the-art in language — and now also vision— tasks.

DALL·E — The creative artist

OpenAI presented DALL·E in February 2021 in a paper titled Zero-Shot Text-to-Image Generation. The system, named after Spanish painter Salvador DalĂ­ and Pixar’s cute robot WALL·E, is a smaller version of GPT-3 (12 billion parameters), trained specifically on text-image pairs. In the words of OpenAI’s researchers: “Manipulating visual concepts through language is now within reach.”

DALL·E explores the possibilities of image generation using the “compositional structure of language.” It combines the meaning of a written sentence with the potential visual representations it may have. Still, like GPT-3, it’s highly dependent on the wording of the sentence to not commit mistakes in the images. Its strength relies on its zero-shot capabilities; it can perform generation tasks it hasn’t been trained on without the need for examples.

Among other capabilities, it can generate images from scratch given a written prompt, regenerate hidden parts of images, control attributes of objects, or integrate them in a single image. Even more impressive, DALL·E can also combine concepts at high levels of abstraction (when told “A snail made of harp,” it often draws the snail as having a harp as the shell) and translate image-to-image (when told “the exact same cat on the top as a sketch on the bottom,” it draws a similar cat than the original picture).

DALL·E shows a rudimentary form of artistry. From the loosely interpretable descriptions of written language, it creates a visual reality. We may be closer to an AI version of “a picture is worth a thousand words” than ever.

Here’s the blog post from OpenAI with visual instances of DALL·E’s abilities:

LaMDA — The next generation of chatbots

Google presented LaMDA in their annual I/O conference on May 2021. LaMDA is expected to revolutionize chatbot technology with its amazing conversational skills. There’s no paper or API yet, so we’ll have to wait to get some results.

LaMDA, which stands for Language Model for Dialogue Applications is the successor of Meena, another Google AI presented in 2020. LaMDA was trained on dialogue and optimized to minimize perplexity, a measure of how confident is a model in predicting the next token. Perplexity correlates highly with human evaluation of conversational skills.

LaMDA stands out as a sensible, specific, interesting, and factual chatbot. In contrast with previous ones, it can navigate the open-ended nature of conversations making sense of its responses. It can make them specific, avoiding always-valid responses such as “I don’t know.” It can make “insightful and unexpected” responses, keeping the conversation interesting. And it makes correct answers when there’s factual knowledge involved.

Here’s a complete review I wrote about LaMDA for Towards Data Science:

MUM — The brain of the search engine

Together with LaMDA Google presented MUM, a system meant to revolutionize the search engine, in a similar — but more impactful — way as BERT did a couple of years ago. As with LaMDA, there is no further information apart from Google’s demo and blog post, so we’ll have to wait for more.

MUM stands for Multitask Unified Model. It’s is a multitasking and multimodal language model 1000x more powerful than BERT, its predecessor. It has been trained in 75 languages and many tasks which gives it a better grasp of the world. However, its multimodal capabilities are what makes MUM stronger than previous models. It can tackle text+image information and tasks, which gives it a versatility neither GPT-3 nor LaMDA have.

MUM is capable of tackling complex search queries such as “You’ve hiked Mt. Adams. Now you want to hike Mt. Fuji next fall, and you want to know what to do differently to prepare.” With today’s search engine a precise and sensible answer would take a bunch of searches and a compilation of information. MUM can do it for you and give you a curated answer. Even more striking, because it is multimodal, “eventually, you might be able to take a photo of your hiking boots and ask, “can I use these to hike Mt. Fuji?”

Here’s a complete review I wrote of MUM for Towards Data Science:

Wu Dao 2.0 — The largest neural network

On the 1st of June, the BAAI annual conference presented Wu Dao 2.0 — translated as Enlightenment. This amazing AI holds the title of largest neural network that a year ago belonged to GPT-3. Wu Dao 2.0 has a striking 1.75 trillion parameters, 10x GPT-3.

Wu Dao 2.0 was trained on 4.9TB of high-quality text and image data. In comparison, GPT-3 was trained on 570GB of text data, almost 10 times less. Wu Dao 2.0 follows the multimodal trend and is able to perform text+image tasks. To train it, researchers invented FastMoE, a successor of Google’s MoE, which is “simple to use, flexible, high-performance, and supports large-scale parallel training.” We’ll probably see other versions of MoE in future models.

Its multimodal nature allows Wu Dao 2.0 to manage a wide set of tasks. It’s able to process and generate text, recognize and generate images, and mixed tasks such as captioning images and creating images from textual descriptions. It can also predict the 3D structures of proteins, like DeepMind’s AlphaFold. It even created a virtual student that can learn continuously. She can write poetry and draw pictures and will learn to code in the future.

Wu Dao 2.0 achieved SOTA levels on some standard language and vision benchmarks, such as LAMBADA, SuperGLUE, MS COCO, or Multi 30K, surpassing GPT-3, DALL·E, CLIP, and CL². These amazing achievements put Wu Dao 2.0 as the most powerful, versatile AI today. Yet, it’s a matter of time that another, bigger AI appears on the horizon. Keep your eyes open!

Here’s a complete review I wrote of Wu Dao 2.0 for Towards Data Science:

 

 

 

 

 

 

Friday, June 11, 2021

How Attention works in Deep Learning: understanding the attention mechanism in sequence models

 

How Attention works in Deep Learning: understanding the attention mechanism in sequence models

I have always worked on computer vision applications. Honestly, transformers and attention-based methods were always the fancy things that I never spent the time to study. You know, maybe later and etc. Now they managed to reach state-of-the-art performance in ImageNet [3].

In NLP, transformers and attention have been utilized successfully in a plethora of tasks including reading comprehension, abstractive summarization, word completion, and others.

After a lot of reading and searching, I realized that it is crucial to understand how attention emerged from NLP and machine translation. This is what this article is all about. After this article, we will inspect the transformer model like a boss. I give you my word.

Let’s start from the beginning: What is attention? Glad you asked!

Memory is attention through time. ~ Alex Graves 2020 [1]

Always keep this in the back of your mind.

The attention mechanism emerged naturally from problems that deal with time-varying data (sequences). So, since we are dealing with “sequences”, let’s formulate the problem in terms of machine learning first. Attention became popular in the general task of dealing with sequences.

Sequence to sequence learning

Before attention and transformers, Sequence to Sequence (Seq2Seq) worked pretty much like this:

seq2seq

The elements of the sequence x1,x2x_1, x_2, etc. are usually called tokens. They can be literally anything. For instance, text representations, pixels, or even images in the case of videos.

OK. So why do we use such models?

The goal is to transform an input sequence (source) to a new one (target).

The two sequences can be of the same or arbitrary length.

In case you are wondering, recurrent neural networks (RNNs) dominated this category of tasks. The reason is simple: we liked to treat sequences sequentially. Sounds obvious and optimal? Transformers proved us it’s not!

A high-level view of encoder and decoder

The encoder and decoder are nothing more than stacked RNN layers, such as LSTM’s. The encoder processes the input and produces one compact representation, called z, from all the input timesteps. It can be regarded as a compressed format of the input.

encoder

On the other hand, the decoder receives the context vector z and generates the output sequence. The most common application of Seq2seq is language translation. We can think of the input sequence as the representation of a sentence in English and the output as the same sentence in French.

decoder

In fact, RNN-based architectures used to work very well especially with LSTM and GRU components.

The problem? Only for small sequences (<20 timesteps). Visually:

scope-per-senquence-length

Let’s inspect some of the reasons why this holds true.

The limitations of RNN’s

The intermediate representation z cannot encode information from all the input timesteps. This is commonly known as the bottleneck problem. The vector z needs to capture all the information about the source sentence.

In theory, mathematics indicate that this is possible. However in practice, how far we can see in the past (the so-called reference window) is finite. RNN’s tend to forget information from timesteps that are far behind.

Let’s see a concrete example. Imagine a sentence of 97 words:

“On offering to help the blind man, the man who then stole his car, had not, at that precise moment, had any evil intention, quite the contrary, what he did was nothing more than obey those feelings of generosity and altruism which, as everyone knows, are the two best traits of human nature and to be found in much more hardened criminals than this one, a simple car-thief without any hope of advancing in his profession, exploited by the real owners of this enterprise, for it is they who take advantage of the needs of the poor.” ~ Jose Saramago, “Blindness.”

Notice anything wrong? Hmmm… The bold words that facilitate the understanding are quite far!

In most cases, the vector z will be unable to compress the information of the early words as well as the 97th word.

Eventually, the system pays more attention to the last parts of the sequence. However, this is not usually the optimal way to approach a sequence task and it is not compatible with the way humans translate or even understand language.

Furthermore, the stacked RNN layer usually create the well-know vanishing gradient problem, as perfectly visualized in the distill article on RNN’s:

memorization-rnns The stacked layers in RNN's may result in the vanishing gradient problem. Source

Thus, let us move beyond the standard encoder-decoder RNN.

Attention to the rescue!

Attention was born in order to address these two things on the Seq2seq model. But how?

The core idea is that the context vector zz should have access to all parts of the input sequence instead of just the last one.

In other words, we need to form a direct connection with each timestamp.

This idea was originally proposed for computer vision. Larochelle and Hinton [5] proposed that by looking at different parts of the image (glimpses), we can learn to accumulate information about a shape and classify the image accordingly.

The same principle was later extended to sequences. We can look at all the different words at the same time and learn to “pay attention“ to the correct ones depending on the task at hand.

And behold. This is what we now call attention, which is simply a notion of memory, gained from attending at multiple inputs through time.

It is crucial in my humble opinion to understand the generality of this concept. To this end, we will cover all the different types that one can divide attention mechanisms.

Types of attention: implicit VS explicit

Before we continue with a concrete example of how attention is used on machine translation, let’s clarify one thing:

Very deep neural networks already learn a form of implicit attention [6].

Deep networks are very rich function approximators. So, without any further modification, they tend to ignore parts of the input and focus on others. For instance, when working on human pose estimation, the network will be more sensitive to the pixels of the human body. Here is an example of self-supervised approaches to videos:

![activations-focus-in-ssl ](activations-focus-in-ssl .png) Where activations tend to focus when trained in a self-supervised way. Image from Misra et al. ECCV 2016. Source

“Many activation units show a preference for human body parts and pose.” ~ Misra et al. 2016

One way to visualize implicit attention is by looking at the partial derivatives with respect to the input. In math, this is the Jacobian matrix, but it’s out of the scope of this article.

However, we have many reasons to enforce this idea of implicit attention. Attention is quite intuitive and interpretable to the human mind. Thus, by asking the network to ‘weigh’ its sensitivity to the input based on memory from previous inputs, we introduce explicit attention. From now on, we will refer to this as attention.

Types of attention: hard VS soft

Another distinction we tend to make is between hard and soft attention. In all the previous cases, we refer to attention that is parametrized by differentiable functions. For the record, this is termed as soft attention in the literature. Officially:

Soft attention means that the function varies smoothly over its domain and, as a result, it is differentiable.

Historically, we had another concept called hard attention.

An intuitive example: You can imagine a robot in a labyrinth that has to make a hard decision on which path to take, as indicated by the red dots.

labyrinth-hard-attention A decision in the labyrinth. Source

In general, hard means that it can be described by discrete variables while soft attention is described by continuous variables. In other words, hard attention replaces a deterministic method with a stochastic sampling model.

In the next example, starting from a random location in the image tries to find the “important pixels” for classification. Roughly, the algorithm has to choose a direction to go inside the image, during training.

hard-attention An example of hard attention.Source

Since hard attention is non-differentiable, we can’t use the standard gradient descent. That’s why we need to train them using Reinforcement Learning (RL) techniques such as policy gradients and the REINFORCE algorithm [6].

Nevertheless, the major issue with the REINFORCE algorithm and similar RL methods is that they have a high variance. To summarize:

Hard attention can be regarded as a switch mechanism to determine whether to attend to a region or not, which means that the function has many abrupt changes over its domain.

Ultimately, given that we already have all the sequence tokens available, we can relax the definition of hard attention. In this way, we have a smooth differentiable function that we can train end to end with our favorite backpropagation.

Let’s get back to our showcase to see it in action!

Attention in our encoder-decoder example

In the encoder-decoder RNN case, given previous state in the decoder as yi1\textbf{y}_{i-1} and the the hidden state h=h1,h2,hn\textbf{h} = {h_1,h_2, h_{n} }, we have something like this:

ei=attentionnet(yi1,h)Rn\textbf{e}_{i}=\operatorname{attention_{net}}\left(y_{i-1}, \textbf{h} \right) \in R{^n}

The index i indicates the prediction step. Essentially, we define a score between the hidden state of the decoder and all the hidden states of the encoder.

More specifically, for each hidden state (denoted by j) h1,h2,hn\textbf{h}_1,\textbf{h}_2,\textbf{h}_n we will calculate a scalar:

eij=attentionnet(yi1,hj)e_{i j}=\operatorname{attention_{net}}\left(\textbf{y}_{i-1}, h_{j}\right)

Visually, in our beloved example, we have something like this:

seq2seq-attention

Notice anything strange?

I used the symbol e in the equation and α in the diagram! Why?

Because, we want some extra properties: a) to make it a probability distribution and b) to make the scores to be far from each other. The latter results in having more confident predictions and is nothing more than our well known softmax.

αij=exp(eij)k=1Txexp(eik)\alpha_{i j}=\frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{T_{x}} \exp \left(e_{i k}\right)}

Finally, here is where the new magic will happen:

zi=j=1Tαijhjz_{i}=\sum_{j=1}^{T} \alpha_{i j} \textbf{h}_{j}

In theory, attention is defined as the weighted average of values. But this time, the weighting is a learned function! Intuitively, we can think of αij\alpha_{i j} as data-dependent dynamic weights. Therefore, it is obvious that we need a notion of memory, and as we said attention weight store the memory that is gained through time

All the aforementioned are independent of how we choose to model attention! We will get down to that in a bit.

Attention as a trainable weight mean for machine translation

I find that the most intuitive way to understand attention in NLP tasks is to think of it as a (soft) alignment between words. But what does this alignment look like? Excellent question!

In machine translation, we can visualize the attention of a trained network using a heatmap such as below. Note that scores are computed dynamically.

attention-alignment Image by Neural Machine translation paper. Source

Notice what happens in the active non-diagonal elements. In the marked red area, the model learned to swap the order of words in translation. Also note that this is not a 1-1 relationship but a 1 to many, meaning that an output word is affected by more than one input word (each one with different importance).

How do we compute attention?

In our previous encoder-decoder example, we denoted attention as attentionnet(yi1,h)\operatorname{attention_{net}}\left(y_{i-1}, \textbf{h} \right) which indicates that it’s the output of a small neural network with inputs the previous state of the decoder as yi1y_{i-1} and the hidden state h=h1,h2,hnh = {h1,h_2, h_{n} }. In fact all we need is a score that describes the relationship between the two states and captures how “aligned” they are.

While a small neural network is the most prominent approach, over the years there have been many different ideas to compute that score. The simplest one, as shown in Luong [7], computes attention as the dot product between the two states yi1hy_{i-1}\textbf{h}. Extending this idea we can introduce a trainable weight matrix in between yi1Wahy_{i-1}W_a\textbf{h}, where WaW_a is an intermediate wmatrix with learnable weights. Extending even further, we can also include an activation function in the mix which leads to our familiar neural network approach vaTtanh(Wa[h;yi1]){v_a^T}{tanh}(W_a[h; y_{i-1}]) proposed by Bahdanau [2]

In certain cases, the alignment is only affected by the position of the hidden state, which can be formulated using simply a softmax function softmax(yi1,h)\operatorname{softmax}(y_{i-1},\textbf{h})

The last one worth mentioning can be found in Graves A. [8] in the context of Neural Turing Machines and calculates attention as a cosine similarity cosine[yi1,h]cosine[y_{i-1},\textbf{h}]

To summarize the different techniques, I’ll borrow this table from Lillian Weng’s excellent article. The symbol sts_t denotes the predictions (I used yty_t), while different WW indicate trainable matrices:

attention-calculation Ways to compute attention. Source

The approach that stood the test of time, however, is the last one proposed by Bahdanau et al. [2]: They parametrize attention as a small fully connected neural network. And obviously, we can extend that to use more layers.

This effectively means that attention is now a set of trainable weights that can be tuned using our standard backpropagation algorithm.

As perfectly stated by Bahdanau et al. [2]:

“Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector. With this new approach, the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.” ~ Neural machine translation by jointly learning to align and translate

So, what do we lose? Hmm... I am glad you asked!

We sacrificed computational complexity. We have another neural network to train and we need to have O(T2)O(T^2) weights (where TT is the length of both the input and output sentence).

Quadratic complexity can often be a problem! Unless you own Google ;)

And that brings us to local attention.

Global vs Local Attention

Until now we assumed that attention is computed over the entire input sequence (global attention). Despite its simplicity, it can be computationally expensive and sometimes unnecessary. As a result, there are papers that suggest local attention as a solution.

In local attention, we consider only a subset of the input units/tokens.

Evidently, this can sometimes be better for very long sequences. Local attention can also be merely seen as hard attention since we need to take a hard decision first, to exclude some input units.

Let’s wrap up the operations in a simple diagram:

attention

The colors in the attention indicate that these weights are constantly changing while in convolution and fully connected layers they are slowly changing by gradient descent.

The last and undeniably the most famous category is self-attention.

Self-attention: the key component of the Transformer architecture

We can also define the attention of the same sequence, called self-attention. Instead of looking for an input-output sequence association/alignment, we are now looking for scores between the elements of the sequence, as depicted below:

attention-graph

Personally, I like to think of self-attention as a graph. Actually, it can be regarded as a (k-vertex) connected undirected weighted graph. Undirected indicates that the matrix is symmetric.

In maths we have: self-attentionnet(x,x)\operatorname{self-attention_{net}}\left(x, x \right). The self-attention can be computed in any mentioned trainable way. The end goal is to create a meaningful representation of the sequence before transforming to another.

Advantages of Attention

Admittedly, attention has a lot of reasons to be effective apart from tackling the bottleneck problem. First, it usually eliminates the vanishing gradient problem, as they provide direct connections between the encoder states and the decoder. Conceptually, they act similarly as skip connections in convolutional neural networks.

One other aspect that I’m personally very excited about is explainability. By inspecting the distribution of attention weights, we can gain insights into the behavior of the model, as well as to understand its limitations.

Think, for example, the English-to-French heatmap we showed before. I had an aha moment when I saw the swap of words in translation. Don’t tell me that it isn't extremely useful.

Attention beyond language translation

Sequences are everywhere!

While transformers are definitely used for machine translation, they are often considered as general-purpose NLP models that are also effective on tasks like text generation, chatbots, text classification, etc. Just take a look at Google’s BERT or OpenAI’s GPT-3.

But we can also go beyond NLP. We briefly saw attention being used in image classification models, where we look at different parts of an image to solve a specific task. In fact, visual attention models recently outperformed the state of the art Imagenet model [3]. We also have seen examples in healthcare, recommender systems, and even on graph neural networks.

To summarize everything said so far in a nutshell, I would say: Attention is much more than transformers and transformers are more than NLP approaches.

Only time will prove me right or wrong!

Conclusion

For a more holistic approach on NLP approaches with attention models we recommend the Coursera course. So if you aim to understand transformers, now you are ready to go! This article was about seeing through the equations of attention.

Attention is a general mechanism that introduces the notion of memory. The memory is stored in the attention weights through time and it gives us an indication on where to look. Finally, we clarified all the possible distinctions of attention and showed a couple of famous ways to compute it.

As a next step, I would advise the TensorFlow tutorial on attention, which you can run in Google Colab. If you want to discover in more depth the principles of attention, the best resource is undeniably Alex Graves’ video from DeepMind:

 

 https://www.youtube.com/watch?v=AIiwuClvH6k

 

If you reached this point, I guess you are super ready for our Transformer article.

Cited as:

@article{adaloglou2020normalization,
title = "How attention works in deep learning: understanding the attention mechanism in sequence models",
author = "Adaloglou, Nikolas and Karagiannakos, Sergios",
journal = "https://theaisummer.com/",
year = "2020",
url = "https://theaisummer.com/attention/"
}

Acknowledgements

Thanks to the awesome Reddit community for identifying my mistake. Memory is attention through time and not vice versa.

References