Multi-Modal Knowledge Acquisition from Documents
Navy STTR FY2010.A


Sol No.: Navy STTR FY2010.A
Topic No.: N10A-T019
Topic Title: Multi-Modal Knowledge Acquisition from Documents
Proposal No.: N10A-019-0065
Firm: ObjectVideo
11600 Sunrise Valley Drive
Suite # 290
Reston, Virginia 20191
Contact: Gaurav Aggarwal
Phone: (703) 654-9300
Web Site: www.objectvideo.com
Abstract: Images with associated text are now available in vast quantities, and provide a rich resource for mining for the relationship between visual information and semantics encoded in language. In particular, the quantity of such data means that sophisticated machine learning approaches can be applied to determine effective models for objects, backgrounds, and scenes. Such understanding can then be used to: (1) understand, label, and index images that do not have text; and (2) augment the semantic understanding of images that do have text. This points to great potential power for searching, browsing, and mining documents containing image data. To this end, this STTR effort proposes a pipeline-based framework that focuses on the difficult task of text-image alignment (or correspondence). The proposed pipeline will take images and associated text to reduce correspondence ambiguity in stages. The framework will include both feed-forward and feed-back controls passing partially inferred information from one stage to another, leading to information enrichment and potential to provide inputs towards learning and understanding of novel objects and concepts. Ideas from both stochastic grammar representations and (joint) probabilistic representations will be investigated to facilitate modeling of text-image associations and visual modeling of objects, scenes, etc.
Benefits: The proposed framework will provide a platform to integrate state-of-the-art text and image understanding algorithms to reduce cross-modal ambiguity in stages in a principled robust manner. The resulting text-image alignment will result in extraction of information not present in either modality alone. The self-learning aspect of the framework has the potential to result in discovery of novel textual and visual concepts along with novel text-image associations not present in the initial knowledge base. Potential benefits of the proposed effort include: (1) improved content-based search in multimodal documents; the query can either be framed using either text or an image sample or both and the retrieval will be based on information extracted from both text and embedded images in the indexed documents; (2) improved content-based image search as opposed to keywords-based search done by current search engines; and (3) enrichment of knowledge bases with discovered novel textual and visual concepts leading to improved knowledge extraction and search capabilities. For DoD, the proposed framework can facilitate automatic enrichment of intelligence reports that contain embedded images, by automatically generating expanded reports with additional information extracted from images that can provide better actionable information to the analysts. Another possible application of the proposed knowledge acquisition framework that will be of interest to DoD is automatic multimodal analysis of blogs on the Internet for any suspicious activity. Potential commercial applications include law enforcement, document search, image search, etc.

Return