SNU Biointelligence Lab

Video Turing Test (VTT)

Toward Human-Level Video Story Understanding

2017~2021 IITP, grant No. 2017-0-01772-VTT, funded by Korea MSIT

The Video Turing Test (VTT) project pursued a grand challenge: building AI systems capable of human-level comprehension of video narratives. Inspired by the Turing Test concept applied to video story understanding, the project proposed a blind evaluation framework in which an AI agent’s video comprehension ability is assessed by its indistinguishability from that of a human viewer. TV drama and movie content served as the primary testbed, demanding that machines extract events, analyse causal relationships, resolve characters across scenes, and engage in open-ended dialogue about video content.

Overview

Story comprehension is widely regarded as a robust proxy for measuring intelligence. Understanding a video story requires integrating visual perception, natural language processing, and multimodal reasoning over long temporal spans — tasks that remain far beyond the reach of standard video recognition systems.

The VTT project addressed this challenge by developing:

The flagship benchmark produced by the project is DramaQA, a character-centered video question-answering dataset built from the Korean TV drama Another Miss Oh. DramaQA provides 17,983 QA pairs across 23,928 video clips at four hierarchical difficulty levels aligned to human cognitive developmental stages, together with 217,308 annotated images containing character bounding boxes, behaviors, emotions, and coreference-resolved scripts.

A culminating system called VINCENT (Video INtelligence CENter Toolkit) was proposed to operationalise the VTT evaluation, demonstrating end-to-end multimodal video comprehension and dialogue.

Research Team

Principal Investigator

Researchers

Affiliation: Video Intelligence Center @ Seoul National University (SNU AIIS)

Technical Approach

The VTT project tackled video story understanding across several complementary research thrusts:

Benchmark Construction and Evaluation

Multimodal Representation Learning

Character Understanding and Re-identification

Scene Graph and Structured Reasoning

Open-ended Video Question Answering

Workshop and Community Building

The project organised the VTT Workshop at ICCV 2019 (November 2, 2019, Seoul), a half-day workshop that brought together leading researchers in video understanding from institutions including UC Berkeley, University of British Columbia, and University of Amsterdam. Nine papers were presented, spanning video stabilisation, character identification in dramatic content, and novel video understanding benchmarks.

The DramaQA challenge was also hosted at ECCV 2020, further establishing the benchmark as a standard for the community.

Publications

Funding

This project was funded by the Institute for Information & Communications Technology Promotion (IITP), grant No. 2017-0-01772-VTT, under the Korean Ministry of Science and ICT (MSIT). Additional related support was provided under IITP grants 2015-0-00310 (SW StarLab), 2018-0-00622 (RMI), and 2019-0-01367 (BabyMind).

Search related publications on the Research page