Papers I Read Notes and Summaries

Net2Net-Accelerating Learning via Knowledge Transfer


  • The paper presents a simple yet effective approach for transferring knowledge from a trained neural network (referred to as the teacher network) to a large, untrained neural network (referred to as the student network).

  • The key idea is to use a function-preserving transformation that guarantees that for any given input, the output from the teacher network and the newly created student network would be the same.

  • Link to the paper

  • Link to an implementation

  • The approach works as follows - Let us say that the teacher network was represented by the transformation y = f(x, θ) where θ refer to the parameters of the network. The task is to choose a new set of parameters θ’ for the student network g(x, θ’) such that for all x, f(x, θ) = g(x, θ’)

  • To start, we can assume that f and g are composed of standard linear layers. Layer i and i+1 are represented by weights Wmxni and Wnxpi+1

  • We want to grow layer i to have q output units (where q > n) and layer i+1 to have q input units. The new weight matrix would be Umxqi and Uqxpi+1

  • The first q columns (rows) of Wi (Wi+1) would be copied as it is into Ui(Ui+1).

  • For filling the remaining n-q slots, columns (rows) would be sampled randomly from Wi (Wi+1).

  • Finally, each layer in Ui is scaled by dividing by the corresponding replication factor to ensure that the output value of function remains unchanged by the operation.

  • Since convolutions can be seen as multiplication by a double block circulant matrix, the approach can be readily extended for convolutional networks.

  • The benefits of using this approach are the following:

    • The newly created student network performs at least as good as the teacher network.
    • Any changes to the network are guaranteed to be an improvement.
    • It is safe to optimize all the parameters in the network.
  • The variant discussed above is called the Net2WiderNet variant. There is another variant calledNet2DeeperNet that enables the network to grow in depth.

  • In that case, a new matrix, U, initialized as the identity matrix, is added to the network. Note that unlike the Net2WiderNet, this approach would not work with arbitrary activation function between the layers.


  • The model can accelerate the training of neural networks, especially during development cycle when the designers try out different models.

  • The approach could potentially be used in life-long learning systems where the model is trained over a stream of data and needs to grow over time.


  • The function preserving transformations need to be worked out manually. Extra care needs to be taken when operations like concatenation or batch norm are present.