PTE  Predictive Text Embedding through Largescale Heterogeneous Text Networks
24 Dec 2017Introduction

Unsupervised text embeddings can be generalized for different tasks but they have weaker predictive powers (as compared to endtoend trained deep learning methods) for any particular task. But the deep learning techniques are expensive and need a large amount of supervised data and a large number of parameters to tune.

The paper introduces Predictive Text Embedding (PTE)  a semisupervised approach which learns an effective low dimensional representation using a large amount of unsupervised data and a small amount of supervised data.

The work can be extended to general information networks as well as classic techniques like MDS, Isomap, Laplacian EigenMaps etc do not scale well for large graphs.

Further, this model can be applied to heterogeneous networks as well unlike the previous works LINE and DeepWalk which work on homogeneous networks only.
Approach

The paper proposes 3 different kinds of networks:
 WordWord Network which captures the word cooccurrence information (local level).
 WordDocument Network which captures the worddocument cooccurrence information (local + document level).
 WordLabel Network which captures the wordlabel cooccurrence information (bipartite graph).

All 3 graphs are integrated into one heterogeneous text network.

First, the authors extend their previous work, LINE, for heterogenous bipartite text networks as explained:

Given a bipartite graph G = (V_{A} \bigcup V_{B}, E) , where V_{A} and V_{B} are disjoint set of vertices, the conditional probability of v_{a} (in set V_{A}) being generated by v_{b} (in set V_{B}) is given as the softmax score between embeddings of v_{a} and v_{b} and normalised by the sum of exponentials of dot products between v_{b} and all nodes in V_{A}.

The second order proximity can be determined by the conditional distributions *p(. v_{j})*p(. v_{j})*. 
The objective to be minimised the KL divergence between the conditional distribution p(.\v_{j}) and the emperical distribution p^{^}(.\v_{j}) (given as w_{i, j}/deg_{j}).
 The objective can be further simplified and optimised using SGD with edge sampling and negative sampling.


Now, the 3 individual networks can all be interpreted as bipartite networks. So node representation of all the 3 individual networks is obtained as described above.

For the wordlabel network, since the training data is sparse, one could either train the unlabelled networks first and then the labelled network or they all could be trained jointly.

For the case of joint training, the edges are sampled from the 3 networks alternatively.

For the finetuning case, the edges are first sampled from the unlabelled network and then from the labelled network.

Once the word embeddings are obtained, the text embeddings may be obtained by simply averaging the word embeddings.
Evaluation

Baseline Models
 Local word cooccurence based methods  SkipGram, LINE(Gww)
 Document word cooccurence based methods  LINE(Gwd), PVDBOW
 Combined method  LINE (Gww + Gwd)
 CNN
 PTE

For long documents, PTE (joint) outperforms CNN and other PTE variants and is around 10 times faster than CNN model.

For short documents, PTE (joint) does not always outperform CNN model probably because the word sense ambiguity is more relevant in the short documents.