Learning Common Sense Through Visual Abstraction

Ramakrishna Vedantam*1, Xiao Lin*1, Tanmay Batra2, C. Lawrence Zitnick3, Devi Parikh1

1Virginia Tech, 2Carnegie Mellon University, 3Microsoft Research

*Equal Contribution


Common sense is essential for building intelligent machines. While some commonsense knowledge is explicitly stated in human-generated text and can be learnt by mining the web, much of it is unwritten. It is often unnecessary and even unnatural to write about commonsense facts.

While unwritten, this commonsense knowledge is not unseen! The visual world around us is full of structure modeled by commonsense knowledge. Can machines learn common sense simply by observing our visual world? Unfortunately, this requires automatic and accurate detection of objects, their attributes, poses, and interactions between objects, which remain challenging problems.

Our key insight is that while visual common sense is depicted in visual content, it is the semantic features that are relevant and not low-level pixel information. In other words, photorealism is not necessary to learn common sense. We explore the use of humangenerated abstract scenes made from clipart for learning common sense. In particular, we reason about the plausibility of an interaction or relation between a pair of nouns by measuring the similarity of the relation and nouns with other relations and nouns we have seen in abstract scenes. We show that the commonsense knowledge we learn is complementary to what can be learnt from sources of text.


Qualitative Examples

A subset of relations along with all corresponding human illustrations collected to form the TRAIN set can be found in [clipart_browser.html].

The predictions from the classifier trained on visual features, to predict tP, tR, and tS for Figure 5 in the main paper are shown in [clipart_browser_w_pred.html]. These are qualitative visualizations to see which relations are most similar visually. We also show similarity between the predictions and the ground truth tuples using our text model based on word2vec.

The predictions of the text+vision model, along with text only and vision only models are given, categorized by relation tR in [assertion_browser.html]. The text tuples and visual illustrations which give most support to the TEST assertion are also shown.

Code and Data

Our assertion dataset, abstract scene illustrations dataset and code for this project are available here [cs_code_data.zip (70MB)].

The raw data for the abstract scene illustrations is available here [abstract_scene_illustrations_raw.zip (1GB)]. Refer to [Github] for guides on format, rendering and feature extraction.



author = {Ramakrishna Vedantam and Xiao Lin and Tanmay Batra and C. Lawrence Zitnick and Devi Parikh},

title = {Learning Common Sense Through Visual Abstraction},

booktitle = {International Conference on Computer Vision (ICCV)},

year = {2015}



We thank Stanislaw Antol for his help with the tuple illustration interface. This work is supported in part by an Allen Distinguished Investigator Award from the Paul G. Allen Family Foundation and by a Google Faculty Research Award to Devi Parikh.