Publikationen (FIS)
Interpretable Visual Understanding with Cognitive Attention Network
- authored by
- Xuejiao Tang, Wenbin Zhang, Yi Yu, Kea Turner, Tyler Derr, Mengyu Wang, Eirini Ntoutsi
- Abstract
While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at github.com/tanjatang/CAN.
- External Organisation(s)
-
Carnegie Mellon University
Research Organization of Information and Systems National Institute of Informatics
University of South Florida
Vanderbilt University
Harvard University
Freie Universität Berlin (FU Berlin)
- Type
- Conference contribution
- Pages
- 555-568
- No. of pages
- 14
- Publication date
- 2021
- Publication status
- Published
- Peer reviewed
- Yes
- ASJC Scopus subject areas
- Theoretical Computer Science, Computer Science(all)
- Electronic version(s)
-
https://doi.org/10.48550/arXiv.2108.02924 (Access:
Open)
https://doi.org/10.1007/978-3-030-86362-3_45 (Access: Closed)