PhD Student at Simula Metropolitan Center for Digital Engineering
HomePage: https://simula.no/people/sushant
LinkedIn: https://www.linkedin.com/in/esushant/
Bridging Multimedia Modalities: Enhanced Multimodal AI Understanding and Intelligent Agents
With the increasing availability of multimodal data, there is growing interest in developing Artificial Intelligence (AI) models capable of comprehending the world in a more holistic manner. This research proposes to leverage multiple input modalities for multimodal AI understanding, enhancing reasoning, generation, and intelligent behavior in conversational agents. The research methodology includes literature review, model development, evaluation, domain-specific applications, and ethical considerations.
Multimodal understanding involves processing information from multiple modalities simultaneously, leading to a more comprehensive understanding of content. The proposed research aims to enhance AI models’ comprehension by effectively integrating and aligning various input modalities, such as text, speech, and images. Challenges like semantic gaps between modalities and effective fusion of complementary information will be addressed.
The integration of multimodal approaches and language models has driven significant advancements in AI systems. Previous research has explored methods to fuse diverse data types and leverage language models for enhanced AI understanding. Researchers have also investigated agent-based frameworks to enhance language model capabilities and provide modular components for various tasks.
The research aims to synthesize different modalities for enhanced multimodal understanding. The main research question focuses on adapting or extending existing AI models to handle multimodal data and enhance reasoning and generation capabilities. Sub-research questions delve into integrating multimodal data into conversational agents and optimizing their incorporation for domain-specific applications.
The research plan involves literature review, data curation, model development, integration with conversational agents, domain-specific applications, evaluation, and documentation. The primary focus will be on adapting existing models to process multiple input modalities, enhancing agents’ capabilities, and optimizing models for specific domains.
Preliminary experiments involve fine-tuning a language model to understand soccer game events using video, audio, and commentary data. The model’s vision-language and audio-language branches are being fine-tuned to capture visual and auditory cues, contributing to comprehensive understanding of events.
The proposed research aims to enhance multimodal understanding by integrating diverse modalities and developing intelligent agents. Ongoing work includes evaluating outputs, refining model architectures, and exploring domain-specific optimizations. The research aims to contribute to the advancement of sophisticated AI systems and multimodal understanding.