RAG for Finance:
From Genesis
to Production
From the original 2020 RAG paper to a glass-box agentic implementation in regulated finance, what the research proves, and what we built at Nexqion.
Every RAG demo looks good. You throw a few PDFs into a vector store, wire up OpenAI embeddings, retrieve the top-5 chunks, and the LLM produces a coherent answer. Five minutes of work. Convincing output.
Then you try to do it in regulated finance. Suddenly the schema that took twenty minutes to design needs to survive a data protection impact assessment.
The constraints are concrete:
- Annual reports that look semantically near-identical from one year to the next.
- GDPR erasure obligations that cascade through your entire knowledge base.
- DORA audit requirements where every retrieval event needs a provenance trail.
- MiFID suitability rules where the wrong portfolio recommendation is a regulatory event, not bad UX.
- Tenant isolation that has to hold under cross-fund query attacks.
I have been building for these constraints for the past few months as part of Nexqion, an AI platform for institutional portfolio management. This article traces how six years of RAG research shaped the architecture: from the original 2020 paper that gave us the term, through the survey papers that mapped the field, into the finance-specific work that exposed which assumptions break under regulatory pressure, and what we built in response.
- RAG's killer feature is hot-swappable knowledge, and that property is what makes it the architecture best suited to GDPR, DORA and MiFID.
- General-purpose RAG breaks against five specific assumptions when finance documents and EU compliance enter the picture.
- The architecture that wins is glass-box, scope-keyed, intent-whitelisted and evidence-grounded. We built one.
Where RAG started
In 2020, before ChatGPT existed, Patrick Lewis and eleven co-authors at Facebook AI Research, with collaborators at UCL and NYU, published a paper called Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. They were not trying to build a chatbot. They were trying to solve a different problem: how do you give a language model access to facts it can update, inspect and cite, without retraining the model from scratch every time the world changes?
Their answer separated knowledge into two memories. Parametric memory lives inside the model's weights. It is dense, opaque, expensive to update, and prone to hallucination because the model cannot tell you which training example produced a given output. Non-parametric memory lives in an external document index. It is sparse, transparent, cheap to update, and every retrieval can be traced back to a specific source passage.
The killer feature in the original paper was not accuracy. It was a small experiment buried in the results section. They built one Wikipedia index from December 2016 and another from December 2018, then asked the same RAG model questions about world leaders.
With the 2016 index: 70% correct on 2016 questions, 4% on 2018 questions. With the 2018 index, the numbers inverted: 12% on 2016, 68% on 2018. Same model. Different index. The world's knowledge had been hot-swapped.
"For a chatbot, hot-swappable knowledge is convenient. For a regulated financial product, it is the difference between a viable architecture and a non-starter."
When a client exercises their GDPR right to erasure, you delete rows from a database, you do not retrain a model. When fund performance changes quarterly, you ingest new run summaries, you do not fine-tune. When a regulator asks which document grounded a particular recommendation, you point to a row with a hash. RAG is not just a technique. It is the architecture whose properties, hot-swappable knowledge, traceable retrieval, deletable rows, survive contact with EU financial compliance most cleanly. Other architectures can be made to comply, but at considerably higher cost.
Six years and 100 papers later
The six years between Lewis et al.'s paper and this one were busy. BlackRock-affiliated authors published a multi-agent equity-research RAG architecture last August. Altbridge published the Self-Driving Portfolio framework in April. UZH and the WMO deployed an agentic RAG system in production for climate-finance tracking. The research has caught up with the constraints, and most of it points in the same direction.
By 2024, Yunfan Gao and colleagues at Tongji University and Fudan published a survey reviewing more than 100 RAG studies. Their technology tree shows the field's branches, pre-training-augmented variants, fine-tuning-augmented variants, inference-time variants, all proliferating after ChatGPT made retrieval-augmented architectures suddenly relevant to industry. The survey distilled three paradigms.
Naïve RAG is the demo: BM25 or TF-IDF keyword retrieval, static datasets, retrieve-read workflow. Easy to ship. Falls over on anything that requires semantic understanding.
Advanced RAG adds dense retrieval (DPR), neural reranking, and multi-hop retrieval. This is where most production systems land, or claim to.
Modular RAG decomposes the pipeline into composable components: hybrid sparse+dense retrievers, query rewriters, rerankers, tool integrations, evaluators. Each module can be swapped without rebuilding the whole system.
Gao et al. drew a quadrant chart that should be on the wall of every team building with LLMs. One axis is external knowledge required; the other is model adaptation required. Prompt engineering sits at the bottom-left: low knowledge, low adaptation. Fine-tuning sits at the top-right: high adaptation, knowledge internalized in weights. RAG sits in between: high external knowledge, low-to-moderate model adaptation.
Most teams treat the choice between RAG and fine-tuning as a binary. The survey makes clear it is not. They are complementary. To, Bui and Le (RAG-IT, arXiv:2412.08179, submitted December 2024; latest revision December 2025) proved this concretely: they used a teacher LLM to generate 800 instruction Q&A pairs from chunked NVIDIA and AMD earnings reports, then fine-tuned Llama-2-7b with LoRA on the generated data. The fine-tuned model scored 4.6/10 on correctness against an unseen Broadcom earnings report, vs Llama-2's baseline 2.8 and GPT-3.5's 5.3. RAG was not the inference architecture. RAG was the data factory.
For a private financial model deployed inside a tenant boundary, this changes the conversation. The question is not "RAG or fine-tune?" The question is where on the quadrant your use case lives, and finance lives at the high-external-knowledge, high-compliance end where both axes matter.
We hit this fork in week three of building Nexqion. Fine-tuning was tempting on inference cost, a private Llama-class model is materially cheaper per turn than calling GPT-4o, and a self-hosted model carries less DPA-renegotiation risk than an external embedding API. The GDPR-erasure cost killed it before we wrote the migration: every weight trained against client data is a row you cannot delete. RAG-as-data-factory is the eventual path; today the non-parametric memory carries everything.
Five assumptions that fail in finance
Most RAG tutorials are built on assumptions that hold for general knowledge bases and break in regulated finance. Each assumption has a corresponding research finding that exposes the gap.
True for Wikipedia. False for fund reports. The annual report for Q4 2023 and Q4 2024 of the same fund will share 80% of their language verbatim, boilerplate, methodology, governance disclosures, and the 20% that differs is exactly what the user is asking about. Standard HyDE retrieval generates one hypothetical document per query, embeds it, finds the nearest chunk, and confidently returns the wrong year.
Multi-HyDE (Srinivasan et al., IIT Madras, FinNLP EMNLP 2025) was built specifically to solve this. It generates N non-equivalent hypothetical documents per query, not similar variants but contextually distinct ones (e.g. one about fraud investigations, one about criminal cases, derived from the same query). Combined with BM25 and a cross-encoder reranker, it lifts retrieval recall on the financial-qa-10K subset from 68.85% to 84.84% (Multi-HyDE + BM25 + cross-encoder configuration), and reduces hallucinations by 15% on FinanceBench/ConvFinQA in human evaluation.
We hit this exact failure on a client fund's Q4 2023 vs Q4 2024 reports, the first prototype retrieved the wrong year roughly 40% of the time on questions like "what was last quarter's drawdown?" Multi-HyDE-style disambiguation is on the next-cycle retrieval upgrade.
The most counterintuitive finding in the entire literature comes from a controlled experiment by Hornuf, Streich and Töllich (CESifo Working Paper 11862, April 2025). They tested 7 LLMs on a 12-stock portfolio allocation task across four information conditions, against a benchmark of US financial advisors recruited via incentivized survey.
The result: generic finance theory had zero effect on LLM recommendations, it was already in the pre-training data. Quantitative firm-specific metrics (beta, size, book-to-market, ESG scores, momentum) improved performance significantly. Qualitative 10-K MD&A summaries improved performance less. Numbers beat narratives.
The same paper found something more uncomfortable. Adding firm-specific information to risk-averse investor profiles increased portfolio risk. The paper offers two possible mechanisms: firm-specific data may have reduced ambiguity aversion (prompting allocation to more stocks), or it may have triggered an implicit-request / desirability effect where the LLM read the additional information as a signal to broaden the universe. The authors note these cannot be cleanly distinguished. Either way, the resulting allocations include stocks inappropriate for the stated risk profile. For MiFID-regulated systems, this is a suitability misalignment risk that has to be designed against, not just monitored.
FinSage (Wang et al., SimpleWay.AI / McGill / Toronto, April 2025) achieved 92.51% recall on expert-curated financial questions using four parallel retrieval paths:
- BM25 for exact-match terminology, ISIN codes, fund names, KPI labels.
- BGE-M3 dense embeddings for semantic similarity.
- BGE-M3 over metadata, headings and section summaries embedded separately.
- HyDE instruction-tuned on financial filings.
Each chunk is stored as a 5-tuple, text, metadata, dense embedding, sparse embedding, metadata embedding. Co-references are resolved at chunk level: pronouns are replaced with antecedents before embedding.
The counter-signal in the same paper is more revealing. GraphRAG, currently heavily marketed, scored a 42.5% pass rate against FinSage's 82.67%. LightRAG scored 13.67%. The graph-RAG hype has not survived contact with real financial filings.
In a general-purpose chatbot, dynamic LLM-driven retrieval routing is a feature. In regulated finance, it is a compliance anti-pattern. A data protection officer who has to audit your system will ask one question: which corpus kinds were accessible for which intent? If the answer is "the model decides," the answer is unacceptable.
The correct design is a hardcoded intent → corpus_kinds whitelist, fully auditable, deterministic, surveyable by a compliance officer who has never read an AI paper. This is not a simplification of dynamic routing. It is the architecture that lets you ship under DORA.
Source: converged practice under DORA / MiFID II, no single paper makes this argument; it is what every audited regulated-AI system in the EU has had to settle on.
Vaghefi et al. (EACL 2026, University of Zurich / WMO / Swiss Finance Institute) built an agent-based RAG system for tracking Early Warning System investments in heterogeneous MDB climate-finance documents, PDFs with nested tables, scattered numeric evidence, non-standardized layouts. Their agent runs a self-healing loop: when retrieved context falls below a coverage threshold, it generates a sub-query and re-queries automatically. Their ablation shows that disabling this loop reduces both evidence F1 and total-amount accuracy, particularly on documents with fragmented tables.
Their glass-box agent reaches 87% accuracy, outperforming Gemini 2.5 Flash and OpenAI Assistants by 8–14 percentage points on the same documents. The finding that matters most: glass-box wins decisively wherever exact financial figures are involved; black-box only wins where numeric cues are buried in narrative prose. For a financial AI system, the documents with exact figures are precisely the ones that matter.
The architecture findings, end to end
Stepping back from the broken assumptions: what does the recent literature actually converge on?
On retrieval architecture
Four design choices converge across multiple independent papers:
- Hybrid sparse+dense beats dense-only on long financial documents (Multi-HyDE ablation).
- Reciprocal Rank Fusion with k=60 appears in Cheng et al. (IEEE '26) and EW4All as the fusion mechanism for hybrid sparse+dense retrieval. FinSage converges on the same hybrid sparse+dense direction but through multi-path retrieval plus reranking rather than RRF specifically.
- Neural reranking is no longer optional, the single highest-leverage addition to a financial RAG pipeline.
- Adaptive cutoffs (cumulative probability ≥55% or score-cliff drop >0.15) prevent context dilution from naive top-k selection.
The single most actionable number comes from the IEEE March 2026 ablation. On a hybrid SQLite FTS5 + FAISS pipeline over 500 S&P 500 10-K reports, with 1,500 queries split across 5 independent test groups on the FinDER benchmark, adding JinaReranker v2 (278M params, multilingual) over the top-30 fused candidates improved correctness from 33.5% to 49.0%, a +15.5pp gain.
Critical errors (Score=1) dropped from 35.3% to 22.5%. Consistent across all five test groups. Without RAG entirely, GPT-4-Turbo scores 9% on the same benchmark.
On corpus design
Numbers beat narratives. Generic finance theory is wasted token budget. Quantitative firm-specific metrics are the highest-value content per byte. Qualitative summaries help, but less. This is the corpus-composition finding from Hornuf et al., and it directly contradicts how most teams build financial RAG corpora, by stuffing in every PDF they can find.
On agentic architecture, from practitioners
BlackRock published AlphaAgents in August 2025 (Zhao, Lyu, Jones, Garber, Pasquali, Mehta). Three GPT-4o agents, each with their own toolkit:
- Fundamental,10-K RAG over yfinance API.
- Sentiment, Bloomberg news with reflection-enhanced summarisation.
- Valuation, OHLCV calculator with mathematical volatility / return computation.
Microsoft AutoGen GroupChat orchestrates Round Robin debate until consensus. Risk tolerance is embedded via role prompting, risk-averse vs risk-neutral.
Risk-neutral profile outperformed equal-weight benchmark over a 4-month back-test on 15 tech stocks. Risk-averse profile underperformed: it correctly excluded high-volatility stocks, which during a tech bull market were exactly the winners. This is a concrete caution against treating prompt-based risk profiling as fail-safe.
They also observed something subtler: risk-seeking and risk-neutral prompts produced near-identical outputs. Adjacent risk profiles do not differentiate meaningfully under prompt engineering alone, a finding that should temper any team building MiFID suitability with prompts and hope.
Ang, Azimbayev and Kim (April 2026) push the same architecture much further. Their Self-Driving Portfolio deploys ~50 specialized agents in a six-stage pipeline:
- Macro Agent, regime classification (expansion / late-cycle / recession / recovery) with confidence scores.
- Asset Class Agents, in parallel, each running multiple CMA methods plus a CMA judge.
- Covariance Agent, historical data plus macro forecasts feed an asset-class covariance matrix.
- Portfolio Construction Agents,20+ methods in parallel: equal-weight, MVO, risk parity, HRP, Black-Litterman, max diversification, min variance, max entropy.
- Strategy Review, Borda-count voting, randomised peer review, adversarial diversifier, CRO risk assessment.
- CIO Ensemble, combines proposals; produces final portfolio plus board memo with dissenting views.
A meta-agent closes the feedback loop. It compares past forecasts against realised returns and rewrites agent code and prompts to improve future performance.
On governance
The Self-Driving Portfolio paper introduces the framing that ties everything together. The Investment Policy Statement is the operational design domain for the agent fleet, borrowed by analogy from SAE J3016 levels of autonomous-vehicle autonomy. L3 systems recommend within IPS parameters and require human approval. L4 systems operate autonomously within constrained domains. The IPS does for portfolio agents what a compliance profile does for retrieval agents: it encodes institutional constraints as machine-readable rules.
This reframed how we think about the compliance_profile kind in Nexqion's corpus. It is not metadata to retrieve when the user asks about compliance, it is the operational design domain that constrains the entire pipeline. The kinds whitelist is one mechanism that enforces it; the audit log is another; tenant scope is a third. The IPS-as-ODD framing makes it considerably easier to defend that architecture to a regulator.
What we built at Nexqion
Here is what that converged stack looks like when you actually have to ship it.
The Nexqion product is the Alpha Quant Agent, a planning and analysis system that reasons over a client's complete investment context: fund structure, performance history, methodology documentation, compliance constraints, past report edits, communication style. The goal is to generate the kind of analysis that normally requires a senior analyst spending half a day pulling sources together.
Building it required answering, stage by stage, how a glass-box RAG pipeline should be designed under regulatory pressure. The pattern that emerges across the recent literature is a six-stage flow with three compliance gates, universal in skeleton, varied in instantiation. Different institutions will tailor the gates to their own regulatory regime; DORA, MiFID, GDPR and SEC each push different design choices on top of the same backbone.
How each stage should be designed
The recommendations below describe what each stage should do and why. They are the converged pattern across the recent literature, not specific to any one institution.
| Stage | Design recommendation |
|---|---|
| 01 | Intent classification. Map the user query to one of a finite, code-defined enumeration of intents, five to ten is enough for most institutions. Avoid free-form LLM routing: a compliance officer must be able to read the intent taxonomy in a single file and understand what it means without prior AI knowledge. Ship golden examples per intent so the classifier is testable. |
| 02 | Authorization gate. A hardcoded intent → permitted corpus kinds mapping. Defined in code, reviewable in one file, never decided at runtime by an LLM. This is what makes "which corpus kinds were accessible to this user on date X" a one-line answer instead of an unanswerable question. The default for an unrecognised intent should be the most-restrictive kind set, not the most-permissive. |
| 03 | Query embedding. Normalise vectors to unit length so dot-product equals cosine similarity without renormalising downstream. Provider-abstract the embedding call so per-tenant residency overrides (DACH strict, US default, etc.) become an env-var change, not a refactor. Fail-soft on provider errors: degrade to no-RAG-hints rather than returning a hard error. |
| 04 | Scoped retrieval. Tenant scope is enforced at the storage layer (SQL WHERE clause, row-level security, or equivalent), never the application layer. Application-layer scoping fails the moment a developer writes one query that forgets the filter. The kind filter from stage 02 is applied alongside scope. Hybrid sparse+dense retrieval and a neural reranker (per IEEE 2026, +15.5pp) are the highest-leverage upgrades once basic retrieval works. |
| 05 | Context assembly. Group hits by kind, not by raw score. Apply a per-kind char or token cap so one chatty hit cannot crowd everything else out. Render kinds in a priority order encoded by the institution's evidence preferences (Hornuf et al.: quantitative beats qualitative). Under a global cap, evict the lowest-priority sections wholesale rather than truncating mid-section, this preserves prompt-cache determinism (same hits → same string). |
| 06 | Grounded generation + audit. Log every retrieval event with: query hash (not raw text, GDPR), tenancy keys (split across user, org and scope), kinds filter, top-k record IDs, model, latency, timestamp. Retain for the regulatory window the institution operates under. The audit log itself must be erasable on a tenant-erasure request, GDPR right to erasure cascades through it, not just the source documents. |
An instantiation, how Nexqion implements it
The recommendations above are the framework. Below is one specific instantiation: how the Alpha Quant Agent answers each design question today, what it ships, and what closes in the next cycle.
System Architecture, Client-Aware Planner Context
text-embedding-3-small (1536-dim) with sentence-transformers all-MiniLM-L6-v2 local fallback (384-dim) for DACH strict-residency tenants. All vectors L2-normalised before storage. Fail-soft on provider error.lists scaled to data size. SQLite-numpy fallback for local dev. Every row carries a scope_key enforced at the SQL layer; the Postgres role used by the API has no permission to query without a scope predicate.Three of the design questions above are answered today; two close in the next cycle.
Stage 04, Tenant isolation at the SQL layer. The scope_key is a column on every embeddings row, every retrieval query filters at the WHERE clause, and the Postgres role used by the API has no permission to query without a scope predicate. Application-layer scoping was rejected as a design, it fails the moment a developer writes one query that forgets the filter.
Stage 03, Fail-soft on embedding errors. If the embedding call fails for any reason, the planning step runs without RAG hints rather than returning a hard error. The user still gets an answer, just one flagged as having reduced contextual grounding. This mirrors the EW4All self-healing posture for the partial-failure case.
Stage 05, Kind-grouped context assembly with priority eviction. This is where the most interesting piece of running code lives. Hits come back from the vector search ungrouped. The prompt-block formatter groups them by kind, applies a per-kind character cap, renders the kinds in priority order, then enforces a global 4 KB total cap by evicting the lowest-priority sections wholesale rather than truncating mid-section.
One data structure governs all of it:
# client_knowledge_base.py, what governs the 4 KB context budget
_KIND_DISPLAY = (
("org_profile", "Firm profile", 800), # highest priority
("compliance_profile", "Compliance", 800),
("methodology", "Methodology", 800),
("fund_profile", "Funds", 1000),
("style_sample", "House style", 1200),
("run", "Past runs", 700),
("report_edit_pattern", "Edit patterns", 500),
("dataset_summary", "Dataset notes", 500),
("schedule", "Schedules", 400),
("reference_report", "Reference packs", 600), # lowest
)
_PROMPT_BLOCK_TOTAL_CAP = 4096
Why wholesale eviction of low-priority sections rather than partial truncation? Two reasons. First, prompt-cache determinism, the same retrieval-hit set must produce the same context string every turn so OpenAI's prompt cache keeps hitting; partial truncation produces near-identical-but-not-identical strings on every turn and destroys the cache. Second, losing a partial methodology section confuses the planner more than losing a low-signal schedule section entirely.
Hornuf et al.'s finding that quantitative firm-specific data outperforms qualitative narratives is encoded directly in the priority order: fund_profile and compliance_profile sit near the top; style_sample and reference_report sit at the bottom. The corpus is not a flat bag of chunks. It is a budgeted, priority-ranked, deterministic ledger.
Stages 01–02, Intent classification and authorization gate. These are the next-cycle close. The data structure is settled, a deterministic intent → kinds mapping defined in one file, and the storage layer already accepts a kind filter. What remains is the orchestrator passing the right list into the retrieval call. Cycle 8.
# client_knowledge_base.py, the next-cycle close (cycle 8)
KINDS_BY_INTENT: dict[ChatIntent, list[CorpusKind]] = {
"performance": ["run", "fund_profile", "dataset_summary"],
"compliance": ["compliance_profile", "org_profile", "methodology"],
"methodology": ["methodology", "fund_profile", "reference_report"],
"narrative": ["style_sample", "report_edit_pattern", "reference_report"],
"attribution": ["run", "methodology", "fund_profile"],
}
kind filtering at the SQL layer (WHERE kind IN (…)); the missing piece is the orchestrator passing kind=KINDS_BY_INTENT[intent] into store.search(). When this lands, the figure above is accurate end-to-end at Nexqion.The roadmap, grounded in numbers
Four changes follow directly from the findings above:
-
Switch IVFFlat → HNSW once tenant scale crosses the index threshold. Today, IVFFlat is created lazily once a tenant has ≥1,000 vectors, with
listsscaled tomin(sqrt(count), 100); below that we sequential-scan inside the scope, which is faster anyway. That is correct for current scale but suboptimal once tenants reach 10K+ vectors. HNSW is more accurate at every size below 100K rows with comparable query latency. pgvector supports HNSW since v0.5.0. -
Add neural reranking, JinaReranker v2. The IEEE March 2026 ablation makes the case unambiguous: +15.5pp correctness from a single cross-encoder pass over top-30 candidates. 278M parameters, multilingual (108 languages, relevant for DACH clients). Adaptive cutoffs (cumulative probability ≥55% or score cliff >0.15) prevent context dilution.
-
Implement hybrid BM25 + vector with RRF. Pure cosine similarity loses on exact-match terminology, fund names, ISIN codes, regulatory references, KPI labels. The RRF formula
1/(60 + r_FTS) + 1/(60 + r_semantic)is implementable directly against the existing pgvector + Postgres FTS stack. -
Build the evaluation harness. Without a way to measure Recall@5, MRR, and faithfulness against held-out user sessions, every change above is invisible. This is the prerequisite for everything else. The metrics framework from FinSage (Faithful evaluation, mean score, pass rate, response time) is the template.
Where this is going
The Self-Driving Portfolio paper's framing is the trajectory. The Investment Policy Statement is the operational design domain for an agent fleet. The compliance profile in Nexqion is the same construct one layer down, it constrains the retrieval and analysis agents the way an IPS constrains a portfolio construction agent fleet.
Glass-box modular RAG with intent whitelists, scope-keyed tenancy, evidence grounding, audit-trail retrieval logging, and fail-soft degradation is not a constraint on what we can build. It is what makes the system trustworthy enough to put in front of a fund manager. The technical infrastructure and the compliance posture are the same thing.
The architecture is governed. The roadmap is specific. What's deliberately missing from this issue: the runtime numbers. We are not publishing tenant count, vector count, or p95 retrieval latency until the eval harness in Issue 05 makes those measurements honest. A "From Genesis to Production" article without measurements would be marketing; we would rather earn the title in the next issue than claim it now.