*
This page is part of a multi-part series on Model-Agnostic Meta-Learning.
If you are already familiar with the topic, use the menu on the right
side to jump straight to the part that interests you. Otherwise,
we suggest you start at the beginning.
*

Reptile, proposed by Nichol et al.

- 1. Sample task \(\tau_i\) from \(p(\tau)\)
- 2. Compute \(\phi_i := U_{\tau_i}(\theta)\), where \(U_{\tau_i}(\theta)\) is a gradient-based optimizer with \(k > 1\) gradient descent steps.
- 3. Update \(\theta\) according to \(\theta = \theta + \beta(\phi_i - \theta)\).

Note the two most noticeable differences to MAML:
First of all, we only sample one task at a time. This is more
similar to the training of the *pretrained*
model than to MAML. Remember that in MAML, we minimize the expected meta-loss over several sampled tasks.
Secondly, it seems like we
are not computing a meta-gradient at all, more like a difference of parameters. And in fact, the formula we
use for
updating \(\theta\),
\[ \theta = \theta + \beta(\phi_i - \theta), \]
is the formula for linear interpolation between \( \theta \) and \( \phi_i \), with interpolation rate
\(\beta
\in [0, 1]\).
If you are not sure what that means in the context of parameter optimization, take a look at this figure. There
you
can play around with the
position of \(\theta\), the optimizer \(U_{\tau_i}\) and the interpolation rate \(\beta\) and see
how Reptile would calculate the update of \( \theta \).

One detail of Reptile is not apparent just from looking at the three steps, but is still important to note: Reptile does not (need to) differentiate between test- and training-sets when computing loss \(\mathcal{L}_{\tau_i}\) since it does not update according to test-set-performance.

Furthermore, it makes sense to spend some time studying why the authors of Reptile explicitly state that
optimizer \(U_{\tau_i}\) must perform
more than one gradient descent step (\(k > 1\)). This is because otherwise
\[ U_{\tau_i}(\theta) = \theta - \alpha \nabla_\theta L_{\tau_i}(\theta) \]
and Reptile updates
\[ \theta = \theta + \beta (\theta - \alpha \nabla_\theta L_{\tau_i}(\theta) - \theta)
= \theta - \alpha \beta \nabla_\theta L_{\tau_i}(\theta), \]
which corresponds to updating \(\theta \) according to standard gradient descent with learning rate \(\alpha
\cdot \beta\).
And this, in turn, is more or less the update scheme we used for the *pretrained* model, which we have
already seen failing.

You might have already figured this out yourself if you set the inner steps of the figure from above
to \(1\) (which were set to \(2\) by default deliberately - and now you know why).
However, it should also be noted, as stated in

Now, at this point, you might have already understood *how* the Reptile update works, but no idea
*if* and *why* it would
find the same optimal initialization that MAML does! As for the *if*, Reptile does not (always) find
the same optimal initialization
that MAML would find since Reptile does not minimize the same objective. However, Reptile performs
competitively well compared to MAML in several few-shot learning
problems.

As for the *why*, let us revisit the update step of Reptile, which we called *linear
interpolation*:
\[ \theta = \theta + \beta(\phi_i - \theta). \]
This formula is also known as the update rule, with which one computes an *exponential moving
average*. Starting with
a current estimate for the empirical mean of a distribution \(theta\), we update our belief based on a new
observation
\(\phi_i\) and discount the update such that we trade-off between our confidence in our previous belief and
the new observation
being close to the true mean.

Hence, we can interpret Reptile as computing an estimate of the average optimal parameter of each task. This confirms what we already discovered with FOMAML: The information important for learning across tasks is contained not within one task (i.e., pretrained approach), and for the most part, not in higher-order derivatives (MAML) but within optimizing on a task with respect to the initial parameters.

Next up is iMAML, which changes the narrative and introduces us to yet another approach to bypass second-order gradients.

**Luis Müller** implemented the visualization of MAML, FOMAML, Reptile and the Comparision. **Max Ploner** created the visualization of iMAML and the svelte elements and components. Both wrote the introduction together and contributed most of the text of the other parts. **Thomas Goerttler** came up with the idea and sketched out the project. He also wrote parts of the manuscript and helped with finalizing the document. **Klaus Obermayer** provided feedback on the project.

† equal contributors