This page is part of a multi-part series on Model-Agnostic Meta-Learning. If you are already familiar with the topic, use the menu on the right side to jump straight to the part that interests you. Otherwise, we suggest you start at the beginning.
After a thorough introduction to MAML, we will next take a look at some of its variants. The variants we are presenting in the following pages all address a particular weakness of MAML, which, however, we first have to discover - a process this page is dedicated to.
We will start with a short recap of the meta-gradient and follow up with a step-by-step expansion of said gradient.
Recall the expectation we wanted to minimize with MAML
$$\mathbb{E}_\tau \left[\, \nabla_\theta
\mathcal{L}_{\tau, \text{test}}(\phi)
\,\right], $$
as well as the resulting meta-gradient (based on samples, rather than computing the whole expectation-integral):
$$\nabla_\theta \mathcal{L}(\theta) = \sum_{i} \nabla_\theta
\mathcal{L}_{\tau_i, \text{test}}(\phi_i)
,$$
where
$$ \phi_i = U_{\tau}(\theta) $$
and \(U_{\tau}\) is the fine-tuning function associated with task distribution \(\tau\)
We will now take a deeper look at \(\nabla_\theta \mathcal{L}(\theta)\) by writing it out more explicitly. To avoid cluttered notation, we will only focus on one summand and thereby drop the indices to the tasks.
In order to gain some understanding of what it actually means to differentiate through fine-tuning function \(U_\tau\), we write out one summand of the meta-gradient: $$ \nabla_\theta \mathcal{L}_{\tau, \text{test}}(\phi) = \nabla_\theta \mathcal{L}_{\tau, \text{test}}(U_{\tau}(\theta)) = \nabla_{U_{\tau}(\theta)} \mathcal{L}_{\tau, \text{test}}(U_{\tau}(\theta)) \nabla_\theta U_{\tau}(\theta) , $$ where we used the chain rule to get from the second to the third equality. Here, \( \nabla_{U_{\tau}(\theta)}\, \mathcal{L}_{\tau, \text{test}} \) is the gradient of the loss function of task \(\tau\) at the optimized parameter \(\phi\), and \(\nabla_\theta U_{\tau}(\theta)\) is the Jacobian of the optimization algorithm (it describes how a change of \(\theta\) is locally transformed into a change of \( \phi \)).
Even if we assume that the optimizer takes only one gradient descent step (as in vanilla MAML), this term becomes \[ \nabla_\theta U_{\tau}(\theta) = \nabla_\theta (\theta - \alpha \nabla_\theta \mathcal{L}_{\tau, \text{train}}(\theta) ) = I - \alpha \nabla^2_\theta \mathcal{L}_{\tau, \text{train}}(\theta). \] Hence, MAML requires us to compute second derivatives in order to optimize \(\theta\), which is computationally costly, especially in high-dimensional problems (such as learning neural nets).
However, one immediate takeaway should also be the latter term (Jacobian) of the product $$ \nabla_{U_{\tau}(\theta)} \mathcal{L}_{\tau, \text{test}}(U_{\tau}(\theta)) \nabla_\theta U_{\tau}(\theta) $$ is costly to compute. We will investigate the Jacobian further and then discuss two ways to implement its calculation in practice.
If we don't restrict ourselves to making just a single gradient update, the term becomes slightly more complicated. Let \(k\) be the number of updates we make and \( \phi^i \) describe the \(i\)th step of gradient descent, then: $$\begin{align} \nabla_\theta \phi^k &= \prod_{i=1}^{k} \nabla_{\phi^{i-1}} \phi^i \\ &= \prod^{k}_{i=1} \left( I - \alpha \nabla^2_{\phi^{i-1}}\mathcal L_{\tau,\text{train}}(\phi^{i-1}) \right). \end{align}$$ Here we have simply applied the chain rule for a variable number of update steps \(k\) and then substituted the term $$\nabla_{\phi^{i-1}} \phi^i,$$ which represents the meta gradient for two consecutive update steps \(i - 1\) and \(i\).
There are two prevalent methods of calculating the meta-gradient in practice. Both methods have their unique limitations, which we will discuss briefly.
The most straightforward approach is to calculate the Jacobian in each update step. Considering the product from above, we can iteratively multiply the new Jacobian term with the accumulated Jacobian matrix, ending up with the full optimizer Jacobian after \(k\) update steps.
The big issue with Jacobian matrices, however, is that they are huge! For deep neural networks, it is not uncommon to have parameters in the magnitude of millions. Though we may not need to store the complete square matrix and can potentially sparsify it, the resulting number of required entries for a large model will still be enormous. This places a huge burden on computational and memory resources. One advantage is that the required memory only depends on the model size but not on the number of update steps.
To find out how one can get around calculating any second-order derivatives, take a look at FOMAML and Reptile.
In order to circumvent the need for calculating and storing a Jacobian matrix, we can also use a different approach: If we take a step back and look at what we ultimately want to calculate, we recognize that we are not really interested in the Jacobian itself, but in a product of the Jacobian (first) with a gradient vector (second factor): \[ \nabla_\theta \mathcal{L}_{\tau, \text{test}}(\phi) = \left( \frac{\mathrm d U_{\tau}(\theta)}{\mathrm d\theta} \right) \left( \nabla_{U_{\tau}(\theta)} \mathcal{L}_{\tau, \text{test}} \right) .\]
This fact can be cleverly exploited.
One key ingredient is a function which we will call
\( \mathrm {sg} \) (for "stop gradient"). This function is actually very simple
(in the implementation as well as its properties): It simply returns its input value
but stops the flow of the gradient during backpropagation.
In TensorFlow this function is implemented with tf.stop_gradient()
while in PyTorch one can use the .detach()
function on a tensor
to produce the same effect. Each time we want to evaluate a gradient of
\(\mathrm{sg}(f(x))\) with respect to any variable \(x\) and with any function
\( f \), we assume
\[
\frac{\mathrm d\, \mathrm {sg}(f(x))}{\mathrm dx} = 0
.\]
This allows us to transform the gradient. By adding a term of this form
(third line) and applying the product rule (fourth line),
we arrive at the following equations:
\[
\begin{align}
\nabla_\theta
\mathcal{L}_{\tau, \text{test}}(\phi) &=
\left(
\frac{\mathrm d \phi}{\mathrm d\theta}
\right)
\nabla_{\phi} \mathcal{L}_{\tau, \text{test}} (\phi) \\
&=
\left(
\frac{\mathrm d \phi}{\mathrm d\theta}
\right)
\mathrm{sg} \left(
\nabla_{\phi} \mathcal{L}_{\tau, \text{test}}(\phi)
\right)
\\
&=
\left(
\frac{\mathrm d \phi}{\mathrm d\theta}
\right)
\mathrm{sg} \left(
\nabla_{\phi} \mathcal{L}_{\tau, \text{test}}(\phi)
\right)
+
\left(
\frac{\mathrm d\, \mathrm{sg} \left(
\nabla_{\phi} \mathcal{L}_{\tau, \text{test}}(\phi)
\right)}{\mathrm d\theta}
\right)
\phi
\\
&=
\frac{\mathrm {d}}{\mathrm {d} \theta}
\left(
\phi\, \cdot
\mathrm{sg} \left(\nabla_{\phi} \mathcal{L}_{\tau, \text{test}}(\phi)\right)
\right)
.
\end{align}\]
Essentially, instead of backpropagating a change in the vector-valued output (as one might do when calculating a jacobian), we backpropagate the dot product of the gradient on the test loss (treating it as a constant) and the task parameter which the inner (task) optimizer produced. As a consequence, we obtain the desired gradient without ever computing the large Jacobian matrix.
One way to think of this is the following: By moving from the current meta-parameter in the opposite direction of this derivative (which is - as we now know - equal to the meta-gradient; i.e., doing gradient descent), we indirectly move the task-parameter in a direction to better align it with the negative gradient on the test loss. If the two vectors (test loss gradient and task-parameter) are unaligned (e.g., the task-parameter is aligned with the negative test loss gradient), then moving into the direction of the negative gradient will further minimize the dot product. Hence, minimizing the dot product and minimizing the test loss is actually equivalent when doing gradient descent.
While this approach does not require us to store or calculate Jacobian matrices, it also comes with a downside: Calculating the derivative of the dot product requires backpropagating through the whole optimization chain. Hence, we need to store the complete computational graph and can't employ an iterative calculation as with the Jacobian matrices. The memory burden of this increases linearly with the number of updates the optimizer takes. It should also be mentioned that while this approach does not require the calculation of Jacobians, it does still require second-order derivatives.
To find out how one might remove the need to store the complete computational graph of the optimization procedure, take a look at iMAML.
Luis Müller implemented the visualization of MAML, FOMAML, Reptile and the Comparision. Max Ploner created the visualization of iMAML and the svelte elements and components. Both wrote the introduction together and contributed most of the text of the other parts. Thomas Goerttler came up with the idea and sketched out the project. He also wrote parts of the manuscript and helped with finalizing the document. Klaus Obermayer provided feedback on the project.
† equal contributors