A Multimodal Agentic RAG System for Real-Time Guidance in Composite Manufacturing Facilities

K. Chawla, K. Mora-Gonzalez, T. Ghosal, W. Halsey, V. Paquit, A.A. Hassen, S. Kim
Oak Ridge National Laboratory,
United States

Keywords: agentic RAG, multimodal retrieval, vision-language models, composite manufacturing equipment, self-validation

Summary:

Advances in digital manufacturing, particularly within composite processing environments, have created an urgent need for intelligent systems capable of interpreting complex equipment documentation and providing real-time operational support. Recent advances in Large Language Models (LLMs) have enabled conversational interfaces for general knowledge retrieval, but they remain fundamentally limited for specialized industrial domains. Traditional RAG pipelines generally follow a fixed linear pattern: identify relevant documents → retrieve them → feed them to a model → produce an answer. While functional, this workflow is static, retrieval-dependent, and highly vulnerable to missing or incomplete context. Agentic RAG fundamentally transforms this paradigm. Rather than passively retrieving text, it actively reasons about what to retrieve and how the retrieved evidence should constrain the final answer. It shifts from a lookup mechanism to an autonomous analytical process. In another words, the system behaves less like a search tool and more like a domain specialist conducting a structured investigation. While LLM with RAG improves grounding, it has major gaps when used for industrial manuals such as inability to retrieve visual content as most RAG systems embed only text. Equipment manuals contain roughly 30–50% visual information. Without figure retrieval, the assistant cannot guide users accurately. Multimodal agentic RAG overcomes the visual gap by using vision-language models to embed both text and images into a shared vector space, enabling cross-modal retrieval (i.e., a text query retrieves both text and the relevant figure). This work introduces an agentic RAG-based LLM assistant specifically designed for composite manufacturing equipment. A defining innovation is the assistant’s ability to automatically integrate visual elements directly into responses. Images are retrieved along with their captions, page numbers and semantic context, enabling evidence-based answers with embedded visual references. This procedure significantly increases transparency and reduces hallucination by grounding every statement in verifiable source content. The agentic controller further enforces a reasoning-before-answering approach, where the LLM generates a hidden chain-of-thought along with an internal self-validation step, but only outputs a concise, citation-backed response. Unlike conventional chat-based assistants, the proposed system integrates domain-specific multimodal retrieval and structured validation pipelines to overcome limitations of hallucination, ambiguous instructions, and lack of equipment-specific knowledge inherent in general-purpose LLMs. By integrating image-based retrieval, structured citation, agentic reasoning, and rigorous validation, the system provides highly reliable, context-rich operational guidance. This approach establishes a foundation for next-generation intelligent manufacturing assistants capable of reducing downtime, accelerating training, and enhancing safety in complex composite processing environments.