John Hedlund-Fay , University of Sheffield, UK
Enterprise NL-to-SQL generation remains brittle in high-compliance environments where query correctness depends on external, prescriptive regulatory logic. While Retrieval-Augmented Generation (RAG) offers a solution, its efficacy in bridging this semantic gap remains under-explored. We present a 400-question benchmark in the football domain and compare a modular decomposition pipeline (RAG-R) with a monolithic agentic architecture (RAG-C). Results indicate that while RAG-C underperformed the best non-RAG baseline, ten-shot Chain-of-Thought (CoT), RAG-R achieved superior performance. Notably, RAG-R outperformed the CoT baseline by 0.116 in average Exact Set Match (EM) and showed a 0.278 EM gain for the highest-difficulty domain-specific queries (p < 0.001). These findings demonstrate the importance of task decomposition in prescriptive RAG systems—where correctness relies on reconciling ambiguous intent with rigid regulatory logic rather than schema knowledge alone.
NL-to-SQL, Retrieval Augment-Generation (RAG), Modular Decomposition, Benchmark Construction Domain-Specific Knowledge (DSK), Enterprise NLP.