**From Bootstrapped Meta-Learning to Time Series Forecasting with Deep Learning, the Relationship between Extrapolation & Generalization and Exploring Diverse Optima with Ridge Rider**

Welcome to the January edition of the ‚Machine-Learning-Collage‘ series, where I provide an overview of the different Deep Learning research streams. So what is a ML collage? Simply put, I draft one-slide visual summaries of one of my favourite recent papers. Every single week. At the end of the month all of the resulting visual collages are collected in a summary blog post. Thereby, I hope to give you a visual and intuitive deep dive into some of the coolest trends. So without further ado: Here are my four favourite papers that I recently read and why I believe them to be important for the future of Deep Learning.

Table of Contents

**‘Bootstrapped Meta-Learning’**

*Authors: Flennerhag et al. (2021)* | 📝 Paper

**One Paragraph Summary: **Meta-Learning algorithms aim to automatically discover inductive biases, which allow for fast adaptation across many tasks. Classic examples include MAML or RLˆ2. Commonly these system are trained on a bi-level optimisation problem, where in a fast inner loop one considers only a single task instantiation. In a second and slower outer loop the weights of the system are then updated by batching across many of such individual tasks. The system is thereby forced to discover and exploit the underlying structure of the task distribution. Most of the times the outer update has to propagate gradients through the inner loop update procedure. This can lead to two problems: How should one choose the length of the inner loop? Short horizons allow for easier optimization, while being potentially short-sighted. Furthermore, the meta objective can behave erratic, suffering from vanishing and exploding gradients. So how might we overcome this myopia & optimisation difficulty? Bootstrapped meta-learning proposes to construct a so-called bootstrap target by running the inner loop a little longer. We can then use the resulting network as a teacher for a shorter horizon student. Similar to DQNs, the bootstrap target is detached from the computation graph and simply acts as a fixed quantity in the loss computation. Thereby, we essentially pull the meta-agent forward. The metric used to compare the expert and the student can furthermore control the curvature of the meta objective. In a set of toy RL experiments, the authors show that bootstrapping can allow for fast exploration adaptation despite a short horizon and that it outperforms plain meta-gradients with a longer horizon. Together with the STACX meta-gradient agent, bootstrapped meta-gradients provide a new ATARI SOTA and can also be applied to multi-task few-shot learning. All in all, this work opens many new perspectives on how to positively manipulate the meta-learning problem formulation.

**‘N-Beats: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting’**

*Authors: Oreshkin et al. (2020)* | 📝 Paper | 🤖 Code

**One Paragraph Summary: **Traditional time series forecasting models such as ARIMA come from the world of financial econometrics and rely on fitted moving averages for trend and seasonality components. They tend to only have few parameters, while maintaining clear interpretability. Recently hybrid models, which combine recurrent neural networks with differentiable forecasts have become more and more popular. This allows for flexible function fitting, while maintaining the inductive biases of more classic approaches. But is it also possible to train competitive forecasters, which are based on pure Deep Learning approaches? In N-Beats the authors introduce a new network architecture for univariate time series forecasting, which establishes a new SOTA on the M3, M4 & tourism benchmark. The architecture consists of multiple stacks of residual blocks, which simultaneously perform both forecasting and backcasting. The partial forecasts of the individual stacks are combined into the final prediction for the considered time horizon. Furthermore, the basis of the individual block predictions can either be learned or fixed to a suitable and interpretable functional form. This can for example be low-dimensional polynomials to capture a trend or periodic functionals for seasonal components. The authors combine their approach with ensembling techniques merging models trained on different metrics, input windows and random initialisations. They additionally show that the performance gains saturate as more stacks are added and visually analyse that the fixed basis stack predictions are indeed interpretable.

**‘Learning in High Dimension Always Amounts to Extrapolation’**

*Authors: Balestriero et al. (2021)* | 📝 Paper | 🗣 Podcast

**One Paragraph Summary: **Can neural networks (NNs) only learn to interpolate? Balestriero et al. argue that NNs have to extrapolate in order to solve high dimensional tasks. Their reasoning relies on a simple definition of interpolation, which is to say that it occurs whenever a datapoint falls into the convex hull of the observed training data. As the dimensionality of the raw input space grows linearly, the volume of this space grows at an exponential rate. We humans struggle with the visualisation of the geometric intuition beyond 3D spaces, but this phenomenon has been commonly known as the curse of dimensionality. But what if the data lies on a lower dimensional manifold? Is it then possible to circumvent the curse of dimensionality and to obtain interpolation with only a few samples? In a set of synthetic experiments the authors show that what actually matters is not the raw dimension of the manifold but the so-called intrinsic dimension — i.e. the smallest affine subspace containing the data manifold. They show that for common computer vision datasets; the probability of a test set sample to be contained in the convex hull of the training set decreases rapidly as the number of considered input dimensions increases. The authors also highlight that this phenomenon is present for neural network embeddings or different dimensionality reduction techniques. In all cases the interpolation percentage decreases as more input dimensions are considered. So what can this tell us? In order for NNs to succeed at solving a task, they have to operate in the “extrapolation” regime! But not all of them generalise as well as others. So this opens up new questions about the relationship between this specific notion of extrapolation and generalisation more generally. What roles do data augmentation and regularization play for example?

**‘Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian’**

*Authors: Parker-Holder et al. (2020)* | 📝 Paper | 🗣 Talk | 🤖 Code

**One Paragraph Summary: **Modern deep learning problems often have to deal with many local optima. Gradient descent has been shown to be biased towards simple high curvature solutions. Classic examples of this problem include shape versus texture optima in computer vision or self-play policies that do not generalise to new players. In which local optimum the optimization procedure ends up, may depend on many arbitrary factors such as the initialisation, data-ordering or details such as regularization. But what if instead of trying to obtain a single optimum, we rather aim to simultaneously explore a diverse set of optima. The Ridge Rider algorithm aims to do so, by iteratively following the eigenvectors of the Hessian with negative eigenvalues — the so-called ridges. The authors show that this procedure is locally loss reducing as long as the eigenvectors smoothly vary along the trajectory. By following these different ridges, ridge rider is capable of covering many different local optima in the contexts of tabular RL and MNIST classification. The authors show that Ridge Rider can also help in discovering optimal zero-shot coordination policies without having access to the underlying problem symmetries. In summary, Ridge Rider turns a continuous optimization problem into a discrete search over the different ridges. It opens up a promising future direction for robust optimization. But there also remain many open questions with regards to the scalability of the method including efficient eigendecomposition and simultaneous exploration of multiple eigenvectors.

This is it for this month 🤗 Let me know what your favourite papers have been. If you want to get some weekly ML collage input, check out the Twitter hashtag #mlcollage and you can also have a look at more collages in the last summary 📖 blog post: