Papers I Read Notes and Summaries

VQA-Visual Question Answering

Problem Statement

  • Given an image and a free-form, open-ended, natural language question (about the image), produce the answer for the image.

  • Link to the paper

VQA Challenge and Workshop

  • The authors organise an annual challenge and workshop to discuss the state-of-the-art methods and best practices in this domain.
  • Interestingly, the second version is starting on 27th April 2017 (today).

Benefits over tasks like image captioning:

  • Simple, n-gram statistics based methods are not sufficient.
  • Requires the system to blend in different aspects of knowledge - object detection, activity recognition, commonsense reasoning etc.
  • Since only short answers are expected, evaluation is easier.


  • Created a new dataset of 50000 realistic, abstract images.
  • Used AMT to crowdsource the task of collecting questions and answers for MS COCO dataset (>200K images) and abstract images.
  • Three questions per image and ten answers per question (along with their confidence) were collected.
  • The entire dataset contains over 760K questions and 10M answers.
  • The authors also performed an exhaustive analysis of the dataset to establish its diversity and to explore how the content of these question-answers differ from that of standard image captioning datasets.

Highlights of data collection methodology

  • Emphasis on questions that require an image, and not just common sense, to be answered correctly.
  • Workers were shown previous questions when writing new questions to increase diversity.
  • Answers collected from multiple users to account for discrepancies in answers by humans.
  • Two modalities supported:
    • Open-ended - produce the answer
    • multiple-choice - select from a set of options provided (18 options comprising of popular, plausible, random and ofc correct answer)

Highlights from data analysis

  • Most questions range from four to ten words while answers range from one to three words.
  • Around 40% questions are “yes/no” questions.
  • Significant (>80%) inter-human agreement for answers.
  • The authors performed a study where human evaluators were asked to answer the questions without looking at the images.
  • Further, they performed a study where evaluators were asked to label if a question could be answered using common sense and what was the youngest age group, they felt, could answer the question.
  • The idea was to establish that a sufficient number of questions in the dataset required more than just common sense to answer.

Baseline Models

  • random selection
  • prior (“yes”) - always answer as yes.
  • per Q-type prior - pick the most popular answer per question type.
  • nearest neighbor - find the k nearest neighbors for the given (image, question) pair.


  • 2-channel model (using vision and language models) followed by softmax over (K = 1000) most frequent answers.

  • Image Channel
    • I - Used last hidden layer of VGGNet to obtain 4096-dim image embedding.
    • norm I - : l2 normalized version of I.
  • Question Channel
    • BoW Q - Bag-of-Words representation for the questions using the top 1000 words plus the top 1- first, second and third words of the questions.
    • LSTM Q - Each word is encoded into 300-dim vectors using fully connected + tanh non-linearity. These embeddings are fed to an LSTM to obtain 1024d-dim embedding.
    • Deeper LSTM Q - Same as LSTM Q but uses two hidden layers to obtain 2048-dim embedding.
  • Multi-Layer Perceptron (MLP) - Combine image and question embeddings to obtain a single embedding.
    • BoW Q + I method - concatenate BoW Q and I embeddings.
    • LSTM Q + I, deeper LSTM Q + norm I methods - image embedding transformed to 1024-dim using a FC layer and tanh non-linearity followed by element-wise multiplication of image and question vectors.
  • Pass combined embedding to an MLP - FC neural network with 2 hidden layers (1000 neurons and 0.5 dropout) with tanh, followed by softmax.
  • Cross-entropy loss with VGGNet parameters frozen.


  • Deeper LSTM Q + norm I is the best model with 58.16% accuracy on open-ended dataset and 63.09% on multiple-choice but far behind the human evaluators (>80% and >90% respectively).
  • The best model performs well for answers involving common visual objects but performs poorly for answers involving counts.
  • Vision only model performs even worse than the model which always produces “yes” as the answer.