*
This page is part of a multi-part series on Model-Agnostic Meta-Learning.
If you are already familiar with the topic, use the menu on the right
side to jump straight to the part that interests you. Otherwise,
we suggest you start at the beginning.
*

FOMAML was suggested by Finn et al. (the authors of MAML, in the very paper that introduces MAML) and
is a straightforward heuristic to get rid of
the second-order terms (which we introduced on
the last page):
Setting them to zero! As a result
\[\nabla_\theta U_{\tau_i}(\theta) = I\]
and the overall meta loss gradient reduces to
\[ \nabla_\theta \mathcal{L}(\theta) = \sum_{i} \nabla_{U_{\tau_i}(\theta)} \mathcal{L}_{\tau_i,
\text{test}} (U_{\tau_i}(\theta))
.\]
Simple right? Maybe a bit too simple. Let us have a detailed look at the term we are discarding, namely
\[ \nabla^2_\theta \mathcal{L}_{\tau_i, \text{train}}(\theta). \]
This term is known as the *Hessian* of loss function \(\mathcal{L}_{\tau_i, \text{train}}\), which
describes the local curvature as a function *MSE* loss, neural net \(M\) and dataset \(\mathcal{D} := (x, y)\). We omit some of the subscripts to
make the formulae more readable and write
\[ \nabla^2 \mathcal{L}(\theta) = \nabla^2 \frac{1}{2} (y - M(x; \theta)^T(y - M(x; \theta)) \]
\[ = \nabla M(x; \theta)\nabla M(x; \theta)^T -(y - M(x; \theta))^T \nabla^2 M(x; \theta). \]
So the only second-order term in the Hessian of the loss function is the Hessian of the neural net \(M\).
While there is empirical evidence of the local curvature
of neural nets being near zero **after training** (and near-zero local curvature would easily
justify dropping the Hessian in the MAML meta-update altogether), the same study also indicates that this is
not necessarily the case
for randomly initialized weights

If you compare, e.g., Table 1 in the MAML paper, you will find that FOMAML easily keeps up with its second-order counterpart in terms of classification performance. So depending on your personal taste in theoretical rigor, this explanation might be more or less satisfactory. If you are nonetheless interested in how local curvature affects a function space, take a look at the following figure. Here we prepared a very simple function space, namely the space of \[ f(x) := \frac{1}{2} (x - \frac{1}{2})^T C (x - \frac{1}{2}) + g^T x, \] with Hessian \(C \in \mathbb{R}^{2 \times 2}\), constant \(g \in \mathbb{R}^2\), and gradient \(C(x - \frac{1}{2}) + g\), where we assume that \(C\) is a symmetric matrix, i.e., that it has the form \[ C = \begin{bmatrix} a & b \\ b & c \end{bmatrix}. \] Changing \(a, b, c \) lets you observe the effect of curvature on the form of the function space. As you should be able to verify, non-zero values for the Hessian curve the space and the more curvature we introduce, the poorer the first-order approximation \(\nabla f_{C=0}(x) \) to the gradient becomes.

Hopefully, you have gained some understanding of how FOMAML works and what effect second-order terms (encoding local curvature) can have on the loss space, as well as arguments for and against linear approximations of the meta-gradient.

FOMAML and the fact that it can compete so easily with MAML tells us that the information necessary to learn across tasks is contained, for the most part, not in any Hessian, but within the first order parts of the meta-gradient. Following up on this narrative, we will next study Reptile, another prominent first-order method, with a slightly different approach.

**Luis Müller** fabricated the visualization of MAML, FOMAML, Reptile and the Comparision. **Max Ploner** created the visualization of iMAML and the svelte elements and components. Both wrote the introduction together and contributed most of the text of the other parts. **Thomas Goerttler** came up with the idea and sketched out the project. He also wrote parts of the manuscript and helped with finalizing the document. **Klaus Obermayer** provided feedback on the project.

† equal contributors