Publikationen (FIS)
Interpretable Visual Understanding with Cognitive Attention Network
- verfasst von
- Xuejiao Tang, Wenbin Zhang, Yi Yu, Kea Turner, Tyler Derr, Mengyu Wang, Eirini Ntoutsi
- Abstract
While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at github.com/tanjatang/CAN.
- Externe Organisation(en)
-
Carnegie Mellon University
Research Organization of Information and Systems National Institute of Informatics
University of South Florida
Vanderbilt University
Harvard University
Freie Universität Berlin (FU Berlin)
- Typ
- Aufsatz in Konferenzband
- Seiten
- 555-568
- Anzahl der Seiten
- 14
- Publikationsdatum
- 2021
- Publikationsstatus
- Veröffentlicht
- Peer-reviewed
- Ja
- ASJC Scopus Sachgebiete
- Theoretische Informatik, Allgemeine Computerwissenschaft
- Elektronische Version(en)
-
https://doi.org/10.48550/arXiv.2108.02924 (Zugang:
Offen)
https://doi.org/10.1007/978-3-030-86362-3_45 (Zugang: Geschlossen)