Multimodal Retrieval-Augmented Generation for Financial Report Question-Answering: Architecture and Evaluation

Authors

  • Pratham Shelke Student, Department of Computer Science and Engineering, Sardar Patel Institute of Technology, Mumbai, India
  • Omkar Yellaram Student, Department of Computer Science and Engineering, Sardar Patel Institute of Technology, Mumbai, India
  • Sunil Ghane Professor, Department of Computer Engineering, Sardar Patel Institute of Technology, Mumbai, India

Abstract

The exponential growth of digital financial reports-rich in text, tables, and graphical data-poses significant challenges for efficient information extraction and analysis. This paper presents a multimodal Large Language Model (LLM)-powered Question Answering (QnA) system for financial documents, focusing on a Retrieval-Augmented Generation (RAG) pipeline. Our system integrates advanced natural language processing, vision-language models, and optimized chunking strategies to retrieve and synthesize information from complex financial filings. Comprehensive evaluation across multiple configurations demonstrates the pipeline's robustness, highlighting the impact of hyperparameter tuning, chunking methods, and embedding strategies on answer faithfulness, relevance, and factual correctness.

Downloads

Download data is not yet available.

Downloads

Published

30-11-2025

Issue

Section

Articles

How to Cite

[1]
P. Shelke, O. Yellaram, and S. Ghane, “Multimodal Retrieval-Augmented Generation for Financial Report Question-Answering: Architecture and Evaluation”, IJRESM, vol. 8, no. 11, pp. 85–89, Nov. 2025, Accessed: Jan. 10, 2026. [Online]. Available: https://journal.ijresm.com/index.php/ijresm/article/view/3383