An Interactive Introduction to Model-Agnostic Meta-Learning

Exploring the world of model-agnostic meta-learning and its variants.

This page is part of a multi-part series on Model-Agnostic Meta-Learning. If you are already familiar with the topic, use the menu on the right side to jump straight to the part that interests you. Otherwise, we suggest you start at the beginning.

iMAML: Implicit Gradients

In this part, we will take a closer look at another variant of MAML, which introduces regularization to model-agnostic meta-learning. To explain how Implicit Model-Agnostic Meta-Learning (iMAML) works, we will start with an observation: If we do many gradient steps in regular MAML, apart from an enormous computational burden, we face the issue that the model-parameters \( \phi \) depend less and less on the meta-parameter \( \theta \). If the parameters (\( \phi \) and \( \theta \)) are largely independent, placing \( \theta \) becomes more difficult, since its effect on \( \phi \) diminishes.

The vanilla MAML approach mitigates this by restricting itself to only use a few gradient steps. This early stopping is equivalent to a Bayesian prior. iMAML, on the other hand, employs an explicit regularization term to counteract this effect.

Let's take another look at the few-shot learning objective that we have used throughout the article: $$ \min_\theta \mathbb E_\tau \left[ \mathcal L \left( \phi_\tau , \mathcal D ^ {test}_\tau \right)\right] $$ Here \( \phi_\tau \) is the task parameter that we acquire after solving the inner optimization problem and \( \mathcal D ^ {test}_\tau \) is the task-level test dataset.

In MAML, \( \phi_\tau \) is obtained by computing a number of gradient descent steps (using our update function \(U_\tau: \Phi \rightarrow \Phi \)): $$ \phi_\tau = U_\tau (\theta) = \theta - \alpha \nabla_\theta \mathcal L \left( \theta, \mathcal D^{train}_\theta \right) $$

Key to iMAML is now to use an arbitrary optimizer that optimizes the task parameter until it reaches a minimum, while adding an \( L_2 \) normalization term to the task loss, instead of restricting the inner optimizer to only a few update steps: $$ \phi_\tau = U^\ast_\tau (\theta) = \arg\min_\phi \left( \mathcal L \left( \phi, \mathcal D^{train}_\tau \right) + \frac{\lambda}{2} \| \phi - \theta \| ^ 2 \right) $$ The objective is the almost the same as we are used to, but now we encourage to keep the euclidean distance between \(\theta\) and the optimal task parameter small. In the following figure, you can explore the impact of the regularization term on the task loss space visually.

Play around with \( \theta \) and \( \lambda \) to get a feeling for the resulting loss space of the inner optimization objective. High \( \lambda \) will encourage the algorithm to place the task-parameter close to the meta-parameter \( \theta \). \( \lambda = 0\) results in the original loss function.

Computing the Gradient

Now, in order to minimize the new meta-objective, we again calculate the gradient: $$ \begin{align} \nabla_\theta\, \mathbb E_\tau \left[ \mathcal L \left( {\phi}_\tau , \mathcal D ^ {test}_\tau \right)\right] &= \mathbb E_\tau \left[ \nabla_{\theta}\, \mathcal L \left( {\phi}_\tau , \mathcal D ^ {test}_\tau \right)\right]\\ &= \mathbb E_\tau \left[ \nabla_{\phi_\tau} \, \mathcal L \left( \phi_\tau , \mathcal D ^ {test}_\tau\right) \cdot \frac{\mathrm d \phi_\tau}{\mathrm d \theta} \right] \end{align} $$ Calculating the first part \( \nabla_{\phi_\tau} \, \mathcal L \left( \phi_\tau , \mathcal D ^ {test}_\tau\right) \) can be done using back-propagation. This is the gradient of \(\phi_\tau\) at the parameter which was found by the optimizer. The term \( \frac{\mathrm d \phi_\tau}{\mathrm d \theta} \) is what MAML has its problems with, involving second-order terms. Depending on how complex the optimization algorithm is, it is undesirable to compute this term, as we have discussed already. In the following, we will study what unique solution iMAML has to this issue.

Here's the awesome part: Assuming that our inner optimizer found a local minimum, we can conclude that the gradient of the inner objective in regard to the task parameter is 0. This gives us the following equation: $$ \begin{align} \mathbf {0} &= \nabla_{\phi} \left( \mathcal L \left( \phi, \mathcal D^{train} \right) + \frac{\lambda}{2} \| \phi - \theta \| ^ 2 \right)\\ &= \nabla_{\phi} \mathcal L \left( \phi, \mathcal D^{train} \right) + \lambda \left( \phi - \theta \right) \end{align} $$ Rearranging the terms, we get: $$ \phi = \theta - \frac{1}{\lambda} \nabla_\phi \mathcal L \left( \phi, \mathcal D^{train} \right) $$

The red arrow denotes the gradient \( \nabla_\phi \mathcal L \left( \phi, \mathcal D^{train} \right) \). The gradient pulls the task parameters \( \phi \) towards the minimum of the task loss. You can imagine the green arrow as being the counter-force that pulls \( \phi \) toward the meta-parameter \( \theta \).

These forces need to cancel out at the optimum since moving in any direction will not improve the regularized loss. Hence, the gradient needs to be orthogonal to the isocurve (white circle): moving along won't change the regularization term; since \( \phi \) is optimal for the joint term, the projection of the task-loss gradient onto the circle must be zero (moving some distance along the circle would improve the joint loss).

Using this result, we can calculate the Jacobian of the task-parameter \( \phi \) with respect to the meta-parameter \( \theta \) as follows: $$ \begin{align} \frac{\mathrm d \phi}{\mathrm d \theta} &= \frac{\mathrm d }{\mathrm d \theta} \left( \theta - \frac{1}{\lambda} \nabla_\phi \mathcal L \left( \phi, \mathcal D^{train} \right) \right)\\ &= \frac{\mathrm d \theta}{\mathrm d \theta} - \frac{1}{\lambda}\frac{ \mathrm d }{\mathrm d \theta} \nabla_\phi \mathcal L \left( \phi, \mathcal D^{train} \right)\\ %&= \frac{\mathrm d \theta}{\mathrm d \theta} - \frac{1}{\lambda}\frac{ \mathrm d }{\mathrm d \theta} %\frac{\mathrm d}{\mathrm d \phi} \mathcal L \left( %\phi, \mathcal D^{train} \right)\\ %&= \frac{\mathrm d \theta}{\mathrm d \theta} - \frac{1}{\lambda}\frac{ \mathrm d \phi}{\mathrm d \theta} %\frac{\mathrm d^2}{\mathrm d \phi ^2} \mathcal L \left( %\phi, \mathcal D^{train} \right)\\ &= I - \frac{1}{\lambda} \nabla^2_\phi \mathcal L \left( \phi, \mathcal D^{train} \right) \frac{\mathrm d \phi}{\mathrm d \theta} \end{align} $$ Here, to get from the 2nd to 3rd line, we applied the chain rule as \( \phi \) is a function of \( \theta \). As a result, we have two terms to calculate: the outer derivative (which results in the Hessian) and the total derivative \( \frac{\mathrm d \phi}{\mathrm d \theta} \)). Solving for \( \frac{\mathrm d \phi}{\mathrm d \theta} \) we get (assuming the inverse exists): $$\begin{align} &&\left(I + \frac{1}{\lambda} \nabla^2_\phi \mathcal L \left( \phi, \mathcal D^{train} \right)\right)\frac{\mathrm d \phi}{\mathrm d \theta} = I\\ \Rightarrow&& \frac{\mathrm d \phi}{\mathrm d \theta} = (I + \frac{1}{\lambda} \nabla^2_\phi \mathcal L \left( \phi, \mathcal D^{train} \right))^{-1} \end{align}$$ Let that sink in for a moment: By assuming that our inner optimizer found an optimal solution for our inner objective, we can derive a closed-form solution for the total derivative \( \frac{\mathrm d \phi}{\mathrm d \theta} \) that does not involve differentiating through the optimizer. To now calculate the meta-gradient, we just need to know the solution of the inner optimization problem without knowing the steps to get there!

In iMAML, the steps leading up to the optimal solution are not of interest when computing the meta-gradient, and hence we could even use an optimizer that cannot be differentiated through. Instead, the optimizer can be treated as a black box, and we only require the final solution.

Before moving on to the actual iMAML algorithm, there is a fantastic read on implicit differentiation in the paper "Efficient and Modular Implicit Differentiation". The authors offer a more general framework for computing gradients without needing to backpropagate through the unrolled forward propagation. Instead, they use an optimality condition - in the iMAML case, it is given by the gradient of the inner loop objective - in order to calculate the gradient implicitly.

Welcome back to reality and its approximations

In the above derivation, we have made two crucial assumptions that might not hold up in real-world scenarios:

  1. ... that we can find the exact \(\phi_\tau\). We are typically unable to obtain the exact optimum for each task on the regularized loss. Instead, the most common optimizers merely find a (hopefully good) approximation.
  2. ... that we can "just invert that matrix". Numerical matrix inversion is not that easy as it is computationally heavy and may be subject to numerical errors.

But do not despair! The authors of iMAML got you covered. They realized that these assumptions would be problematic and offer an approach to mitigate this issue, leading up to the practical iMAML algorithm. In the following paragraphs, we want to briefly outline how iMAML deals with the above issues.

Let \( g \) be the meta-gradient that we want to find. Then we know from the equations above that the following identities hold: $$ \begin{align} &&g &= \frac{\mathrm d \phi}{\mathrm d \theta}\, \nabla_\phi \mathcal L \left( \phi, \mathcal D ^{train} \right)\\ &\Rightarrow& \left(I + \frac{1}{\lambda} \nabla^2_\phi \mathcal L \left( \phi, \mathcal D^{train} \right)\right)\, g &= \nabla_\phi \mathcal L \left( \phi, \mathcal D ^{train} \right) \end{align}, $$ where the second equation can be written as a linear system of equations, \( Ax = b \).

We are in luck as there are exist many common numerical approaches to solve such equations, one of them being an algorithm called "Conjugate Gradient" (short: CG). An explanation of how the algorithm works is outside the scope of this article, but you should know the following: If a solution exists, CG guarantees to find the solution in a small number of steps (depending on the dimensionality of the matrix). Additionally, we never really need to calculate the matrix \(A\); it suffices to compute the product of the matrix with some vector \( v \).

Solving the above system for \(g \) gives us the meta-gradient, dealing with the issue of matrix inversions. Further, Rajeswaran et al. theoretically prove that feasible approximations to the optimal task parameter \(\phi_\tau\) behave sufficiently well. As they show that empirically, iMAML is competitive with the other methods we discussed.

Discussion

As we have seen in the meta-gradient above, iMAML requires the computation of a second-order derivative. The huge benefit is that this second-order derivative only needs to be calculated for the last point the optimizer arrived at. We do not need to pass the gradient information through the steps of gradient-descent.

While calculating the gradient is comparatively easy, iMAML requires an optimizer that finds a quasi-optimal solution. Rajeswaran et al. show that the gradient is still approximately correct as long as the solution provided by the inner-loop optimizer is approximately correct. Still, we need more gradient steps if we use SGD than in regular MAML, where even one step may suffice.

According to the same paper, iMAML produces better results than MAML while consuming comparable resources. Whereas iMAML requires more inner loop steps, MAML requires either more outer loops steps or the expensive computation of a long back-propagation chain. Compared to first-order MAML (FOMAML) and REPTILE, the authors report better results on Omniglot (remember the little exercise on the introduction page?) and Mini-ImageNet , two common few-shot classification datasets.

As Ferenc Huszár points out in his wonderful blog post on iMAML, iMAML does not consider the stochasticity of Stochastic Gradient-Descent: SGD may have non-zero probabilities of finding more than one task-level optimum, but iMAML will only derive the gradient in respect to an actually found optimum.

If you are interested in this consideration, you may also want to take a look at the paper titled "Probabilistic Model-Agnostic Meta-Learning".

After having studied three of the most prominent variants of MAML, we will spend some time comparing the MAML and its variants interactively in the next part. Better close some background tasks on your device, 'cuz it'll get computationally heavy 👩‍💻.

Author Contributions

Luis Müller implemented the visualization of MAML, FOMAML, Reptile and the Comparision. Max Ploner created the visualization of iMAML and the svelte elements and components. Both wrote the introduction together and contributed most of the text of the other parts. Thomas Goerttler came up with the idea and sketched out the project. He also wrote parts of the manuscript and helped with finalizing the document. Klaus Obermayer provided feedback on the project.

† equal contributors