Papers I Read Notes and Summaries

Smooth Loss Functions for Deep Top-k Classification


  • For top-k classification tasks, cross entropy is widely used as the learning objective even though it is the optimal metric only in the limit of infinite data.

  • The paper introduces a family of smoothed loss functions that are specially designed for top-k optimization.

  • Paper

  • Code


  • Inspired by the multi-loss SVMs, a surrogate loss (lk) is introduced that creates a margin between the ground truth and the kth largest score.

Equation 1

  • Here s denotes the output of the classifier model to be learnt, y is the ground truth label, s[p] denotes the kth largest element of s and s\p denotes the vector s without pth element.

  • This lk loss has two limitations:

    • It is continous but not differentiable in s.

    • Its weak derivatives have at most 2-nonzero elements.

  • The loss can be reformulated by adding and subtracting the k-1 largest scores of s\y and sy and by introducing a temperature parameter τ.

Equation 2

Properties of L

  • For any τ > 0, L is infinite-differentiable and has non-sparse gradients.

  • Under mild conditions, L apporachs lk (in a pointwise sense) as τ approaches to 0++.

  • It is an upper bound on the actual loss (up to a constant factor).

  • It is a generalization of the cross-entropy loss for different values of k, and τ and higher margins.

Computational Challenges

  • nCk number of terms needs to be evaluated for computing the loss for one sample (n is number of classes).

  • Loss L can be expressed in terms of elementary symmetric polynomials σi(e) (sum of all products of i distinct elements of vector e). Thus the challenge is to compute σk efficiently.

Forward Computation

  • Compute σk(e) where e is a n-dimensional vector and k« n and e[i]!=0 for all i.

  • σi(e) can be computed using the coefficients of the polynomial (X+e1)(X+e2)…(X+en) by divide and conquer approach with polynomial multiplication.

  • With some more optimizations (eg log(n) levels of recursion and each level being parallelized on a GPU), the resulting algorithms scale well with n on a GPU.

  • Operations are performed in the log-space using the log-sum-exp trick to achieve numerical stability in single floating point precision.

Backward computation

  • The backward pass uses optimizations like computing derivative of σj with respect to ei in a recursive manner.

  • Appendix of the paper describes these techniques in detail.


  • Experiments are performed on CIFAR-100 (with noise) and Imagenet.

  • For CIFAR-100 with noise, the labels are randomized with probability p (within the same top-level class).

  • The proposed loss function is very robust to both noise and reduction in the amount of training dataset as compared to cross-entropy loss function for both top-k and top-1 performance.