Smooth Loss Functions for Deep Topk Classification
25 Dec 2018Introduction

For topk classification tasks, cross entropy is widely used as the learning objective even though it is the optimal metric only in the limit of infinite data.

The paper introduces a family of smoothed loss functions that are specially designed for topk optimization.
Idea
 Inspired by the multiloss SVMs, a surrogate loss (l_{k}) is introduced that creates a margin between the ground truth and the kth largest score.

Here s denotes the output of the classifier model to be learnt, y is the ground truth label, s[p] denotes the kth largest element of s and s\p denotes the vector s without pth element.

This l_{k} loss has two limitations:

It is continous but not differentiable in s.

Its weak derivatives have at most 2nonzero elements.


The loss can be reformulated by adding and subtracting the k1 largest scores of s\y and s_{y} and by introducing a temperature parameter τ.
Properties of L_{kτ}

For any τ > 0, L_{kτ} is infinitedifferentiable and has nonsparse gradients.

Under mild conditions, L_{kτ} apporachs l_{k} (in a pointwise sense) as τ approaches to 0+^{+}.

It is an upper bound on the actual loss (up to a constant factor).

It is a generalization of the crossentropy loss for different values of k, and τ and higher margins.
Computational Challenges

nCk number of terms needs to be evaluated for computing the loss for one sample (n is number of classes).

Loss L_{kτ} can be expressed in terms of elementary symmetric polynomials σ_{i}(e) (sum of all products of i distinct elements of vector e). Thus the challenge is to compute σ_{k} efficiently.
Forward Computation

Compute σ_{k}(e) where e is a ndimensional vector and k« n and e[i]!=0 for all i.

σ_{i}(e) can be computed using the coefficients of the polynomial (X+e_{1})(X+e_{2})…(X+e_{n}) by divide and conquer approach with polynomial multiplication.

With some more optimizations (eg log(n) levels of recursion and each level being parallelized on a GPU), the resulting algorithms scale well with n on a GPU.

Operations are performed in the logspace using the logsumexp trick to achieve numerical stability in single floating point precision.
Backward computation

The backward pass uses optimizations like computing derivative of σ_{j} with respect to e_{i} in a recursive manner.

Appendix of the paper describes these techniques in detail.
Experiments

Experiments are performed on CIFAR100 (with noise) and Imagenet.

For CIFAR100 with noise, the labels are randomized with probability p (within the same toplevel class).

The proposed loss function is very robust to both noise and reduction in the amount of training dataset as compared to crossentropy loss function for both topk and top1 performance.