Papers I Read Notes and Summaries

Modular meta-learning


  • The paper proposes an approach for learning neural networks (modules) that can be combined in different ways to solve different tasks (combinatorial generalization).

  • The proposed model is called as BOUNCEGRAD.

  • Link to the paper

  • Link to the code


  • Focuses on supervised learning.

  • Task distribution p(T).

  • Each task is a joint distribution pT(x, y) over (x, y) data pairs.

  • Given data from m meta-training tasks, and a meta-test task, find a hypothesis h which performs well on the unseen data drawn from the meta-test task.

Structured Hypothesis

  • Given a compositional scheme C, a set of modules F1, …, Fk (represented as a whole by F) and the set of their respective parameters θ1, …, θk (represented as a whole by θ), (C, F, θ) represents the set of possible functional input-output mappings. These mappings form the hypothesis space.

  • A structured hypothesis model is specified by what modules to use and their parametric forms (but not the values).

Examples of compositional schemes

  • Choosing a single module for the task at hand.

  • Fixed compositional structure but different modules selected every time.

  • Weight ensemble (maybe using attention mechanism)

  • General function composition tree


  • Offline Meta Learning Phase:

    • Take training and validation dataset for the first k tasks and generate a parameterization for each module θ1, …, θk.

    • The hypothesis (or composition) to use comes from the online meta-test learning phase.

    • In this stage, find the best θ given a structure.

  • Online Meta-test Learning Phase

    • Given a hypothesis space and θ, the output is a compositional form (or hypothesis) that specifies how to compose the models.

    • In this stage, find the best structure, given a hypothesis space and θ.

Learning Algorithm

  • During Meta-test learning phase, simulated annealing is used to find the optimal structure, with temperature T decreased over time.

  • During meta-learning phrase, the actual objective function is replaced by a surrogate, smooth objective function (during the search step) to avoid local minima.

  • Once a structure has been picked, any gradient descent based approach can be used to optimize the modules.

  • Basically the state of optimization process comprises of the parameters and the temperature. Together, they are used to induce a distribution over the structures. Given a structure, θ is optimized and T is annealed over time.

  • The learning procedure can be improved upon by performing parameter tuning during the online (meta-test learning) phase as well. the resulting approach is referred to as MOMA - MOdular MAml.



  • Pooled - Single network using combined data of all the tasks.

  • MAML - Single network using MAML

  • BOUNCEGRAD - Modular Network without MAML adaptation in online learning.

  • MOMA - BOUNCEGRAD with MAML adaptation in online learning.


Simple Functional Relationships

  • Sine-function prediction problem

  • In general, MOMA outperforms other models.

  • With a small amount of online training data, BOUNCEGRAD outperforms other models as it has a better structural prior.

Predicting next frame of a kinematic skeleton (motion capture data)

  • 11 different objects (with different shapes) on 4 surfaces with different friction properties.

  • 2 meta-learning scenarios are considered. In the first case, the object-surface combination in the test case was present in some meta-training tasks and in the other case, it was not present.

  • For previously seen combinations, MOMA performs the best followed by BOUNCEGRAD and MAML.

  • For unseen combinations, all the 3 are equally good.

  • Compositional scheme is the attention mechanism.

  • An interesting result is that the modules seem to specialize (and activate more often) based on the shape of the object.

Predicting next frame of a kinematic selection (using motion capture data)

  • Composition Structure - generating kinematics subtrees for each body part (2 legs, 2 arms, 2 torsi).

  • Again 2 setups are used - one where all activities in the training and the meta-test task are shared while the other setup where the activities are not shared.

  • For known activities MOMA and BOUNCEGRAD perform the best while for unknown activities, MOMS performs the best.


  • While the approach is interesting, maybe a more suitable set of tasks (from the point of composition) would be more convincing.

  • It would be useful to see the computational tradeoff between MAML, BOUNCEGRAD, and MOMA.