Multimodal Retrieval-Augmented Generation for Financial Report Question-Answering: Architecture and Evaluation
Abstract
The exponential growth of digital financial reports-rich in text, tables, and graphical data-poses significant challenges for efficient information extraction and analysis. This paper presents a multimodal Large Language Model (LLM)-powered Question Answering (QnA) system for financial documents, focusing on a Retrieval-Augmented Generation (RAG) pipeline. Our system integrates advanced natural language processing, vision-language models, and optimized chunking strategies to retrieve and synthesize information from complex financial filings. Comprehensive evaluation across multiple configurations demonstrates the pipeline's robustness, highlighting the impact of hyperparameter tuning, chunking methods, and embedding strategies on answer faithfulness, relevance, and factual correctness.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Pratham Shelke, Omkar Yellaram, Sunil Ghane

This work is licensed under a Creative Commons Attribution 4.0 International License.
