An Interactive Introduction to Model-Agnostic Meta-Learning

Exploring the world of model-agnostic meta-learning and its variants.

This page is part of a multi-part series on Model-Agnostic Meta-Learning. If you are already familiar with the topic, use the menu on the right side to jump straight to the part that interests you. Otherwise, we suggest you start at the beginning.

Reptile

Reptile, proposed by Nichol et al. , is another first-order version of MAML that uses a simple update procedure to find the optimal initialization \(\theta\). Let's have a look:

1. Sample task \(\tau_i\) from \(p(\tau)\)
2. Compute \(\phi_i := U_{\tau_i}(\theta)\), where \(U_{\tau_i}(\theta)\) is a gradient-based optimizer with \(k > 1\) gradient descent steps.
3. Update \(\theta\) according to \(\theta = \theta + \beta(\phi_i - \theta)\).

Note the two most noticeable differences to MAML: First of all, we only sample one task at a time. This is more similar to the training of the pretrained model than to MAML. Remember that in MAML, we minimize the expected meta-loss over several sampled tasks. Secondly, it seems like we are not computing a meta-gradient at all, more like a difference of parameters. And in fact, the formula we use for updating \(\theta\), \[ \theta = \theta + \beta(\phi_i - \theta), \] is the formula for linear interpolation between \( \theta \) and \( \phi_i \), with interpolation rate \(\beta \in [0, 1]\). If you are not sure what that means in the context of parameter optimization, take a look at this figure. There you can play around with the position of \(\theta\), the optimizer \(U_{\tau_i}\) and the interpolation rate \(\beta\) and see how Reptile would calculate the update of \( \theta \).

One detail of Reptile is not apparent just from looking at the three steps, but is still important to note: Reptile does not (need to) differentiate between test- and training-sets when computing loss \(\mathcal{L}_{\tau_i}\) since it does not update according to test-set-performance.

Furthermore, it makes sense to spend some time studying why the authors of Reptile explicitly state that optimizer \(U_{\tau_i}\) must perform more than one gradient descent step (\(k > 1\)). This is because otherwise \[ U_{\tau_i}(\theta) = \theta - \alpha \nabla_\theta L_{\tau_i}(\theta) \] and Reptile updates \[ \theta = \theta + \beta (\theta - \alpha \nabla_\theta L_{\tau_i}(\theta) - \theta) = \theta - \alpha \beta \nabla_\theta L_{\tau_i}(\theta), \] which corresponds to updating \(\theta \) according to standard gradient descent with learning rate \(\alpha \cdot \beta\). And this, in turn, is more or less the update scheme we used for the pretrained model, which we have already seen failing.

You might have already figured this out yourself if you set the inner steps of the figure from above to \(1\) (which were set to \(2\) by default deliberately - and now you know why). However, it should also be noted, as stated in , that as soon as \(k > 1\), the update step cannot be reduced to simple gradient descent anymore since it involves terms accounting for meta-performance.

Now, at this point, you might have already understood how the Reptile update works, but no idea if and why it would find the same optimal initialization that MAML does! As for the if, Reptile does not (always) find the same optimal initialization that MAML would find since Reptile does not minimize the same objective. However, Reptile performs competitively well compared to MAML in several few-shot learning problems.

As for the why, let us revisit the update step of Reptile, which we called linear interpolation: \[ \theta = \theta + \beta(\phi_i - \theta). \] This formula is also known as the update rule, with which one computes an exponential moving average. Starting with a current estimate for the empirical mean of a distribution \(theta\), we update our belief based on a new observation \(\phi_i\) and discount the update such that we trade-off between our confidence in our previous belief and the new observation being close to the true mean.

Hence, we can interpret Reptile as computing an estimate of the average optimal parameter of each task. This confirms what we already discovered with FOMAML: The information important for learning across tasks is contained not within one task (i.e., pretrained approach), and for the most part, not in higher-order derivatives (MAML) but within optimizing on a task with respect to the initial parameters.

Next up is iMAML, which changes the narrative and introduces us to yet another approach to bypass second-order gradients.

Author Contributions

Luis Müller implemented the visualization of MAML, FOMAML, Reptile and the Comparision. Max Ploner created the visualization of iMAML and the svelte elements and components. Both wrote the introduction together and contributed most of the text of the other parts. Thomas Goerttler came up with the idea and sketched out the project. He also wrote parts of the manuscript and helped with finalizing the document. Klaus Obermayer provided feedback on the project.

† equal contributors