Papers I Read Notes and Summaries

HoME - a Household Multimodal Environment


  • Environment for learning using modalities like vision, audio, semantics, physics and interaction with objects and other agents.

  • Link to the paper


  • Humans learn by interacting with their surroundings (environment).

  • Similarly training an agent in an interactive multi-model environment (virtual embodiment) could be useful for a learning agent.


  • Open-source and Open-AI gym compatible

  • Built on top of 45000 3D house layouts from SUNCG dataset.

  • Provides both 3D visual and audio recording.

  • Semantic image segmentation and langauge description of objects.


  • Rendering Engine

    • Implemented using Panda 3D game engine.

    • Renders RGB+depth scenes based on textures, multi-source lightings and shadows.

  • Acoustic Engine

    • Implemented using EVERT

    • Supports multiple microphones, sound sources, sound absorption based on material, atmospheric conditions etc.

  • Semantics Engine

    • Provides a short textual description for each object, along with information like color, category, material size, location etc.
  • Physics Engine

    • Implemented using Bullet3 Engine

    • Supports physical interaction, external forces like gravity and position and velocity information for multiple agents.

Potential Applications

  • Visual Question Answering

  • Conversational Agents

  • Training an agent to follow instructions

  • Multi-agent communication