Papers I Read Notes and Summaries

Efficient Lifelong Learning with A-GEM


  • A new (and more realistic) evaluation protocol for lifelong learning where each data point is observed just once and a disjoint set of tasks are used for training and validation.

  • A new metric that focuses on the efficiency of the models - in terms of sample complexity and computational (and memory) costs.

  • Modification of Gradient Episodic Memory ie GEM which reduces the computational overhead of GEM without compromising on the results.

  • Empirical validation that using task descriptors help lifelong learning models and improve their few-shot learning capabilities.

  • Link to the paper

  • Link to the code

Learning Protocol

  • Two group of datasets - one for training and evaluation (DEV) and other for cross validation (DCV).

  • Data can be sampled multiple times for cross-validation dataset but only once from the training dataset.

  • Each group of dataset (say DEV or DCV) is a list of task-specific datasets Dk (k is the task index).

  • Each sample in Dk is of the form (x, t, y) where x is the data, t is the task descriptor and y is the output.

  • Dk contains Bk minibatches of data.



  • ak,i,j = accuracy on test task j after training on ith minibatch of training task k.

  • Ak = mean over all j = 1 to k (ak, Bk, j) ie train the model on data for task k and then test it on all the tasks.

Forgetting Measure

  • fjk = forgetting on task j after training on all minibatches upto task k.

  • fjk = max over all l = 1 to k-1 (al, Blj - ak, Bkj)

  • Forgetting = Fk = mean over all j = 1 to k-1 (fjk)

LCA - Learning Curve Area

  • Zb = average b shot performance where b is the minibatch number.

  • Zb = mean over all k = 0 to T (ak, b, k)

  • LCAβ = mean over all b = 0 to β (Zb)

  • One special case is LCA0 which is the forward transfer performance or performance on the unseen task.

  • In experiments, β is kept small as we want the model to learn from few examples.


  • GEM has been shown to be very effective in single epoch setting but introduces a very high computational overhead.

  • Average GEM (AGEM) reduces this overhead by sampling (and using) only some examples from the episodic memory instead of using all the examples.

  • While GEM provides better guarantees in terms of worst-case forgetting, AGEM provides better guarantees in terms of average accuracy.

Joint Embedding Model Using Compositional Task Descriptors

  • Compositional Task Descriptors are used to speed training on the subsequent tasks.

  • A matrix specifying the attribute value of objects (to be recognized in the task) are used.

  • A joint-embedding space between image features and attribute embeddings is learned.





  • AGEM outperforms other models on all the datasets expect MNIST where the Progressive Neural Networks lead. One reason could be that MNIST has a large number of training examples per task. But Progressive Neural Networks lead to bad utilization of capacity.

  • While AGEM and GEM have similar performance, GEM has a much higher computational and memory overhead.

  • Use of task descriptors improves the accuracy for all the models.

  • It seems that AGEM offers a good tradeoff between average accuracy performance and efficiency - in terms of sample efficiency, memory requirements and computational costs.