Issue 04 · Engineering

RAG for Finance:
From Genesis
to Production

From the original 2020 RAG paper to a glass-box agentic implementation in regulated finance, what the research proves, and what we built at Nexqion.

Every RAG demo looks good. You throw a few PDFs into a vector store, wire up OpenAI embeddings, retrieve the top-5 chunks, and the LLM produces a coherent answer. Five minutes of work. Convincing output.

Then you try to do it in regulated finance. Suddenly the schema that took twenty minutes to design needs to survive a data protection impact assessment.

The constraints are concrete:

  • Annual reports that look semantically near-identical from one year to the next.
  • GDPR erasure obligations that cascade through your entire knowledge base.
  • DORA audit requirements where every retrieval event needs a provenance trail.
  • MiFID suitability rules where the wrong portfolio recommendation is a regulatory event, not bad UX.
  • Tenant isolation that has to hold under cross-fund query attacks.

I have been building for these constraints for the past few months as part of Nexqion, an AI platform for institutional portfolio management. This article traces how six years of RAG research shaped the architecture: from the original 2020 paper that gave us the term, through the survey papers that mapped the field, into the finance-specific work that exposed which assumptions break under regulatory pressure, and what we built in response.

Where RAG started

In 2020, before ChatGPT existed, Patrick Lewis and eleven co-authors at Facebook AI Research, with collaborators at UCL and NYU, published a paper called Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. They were not trying to build a chatbot. They were trying to solve a different problem: how do you give a language model access to facts it can update, inspect and cite, without retraining the model from scratch every time the world changes?

Their answer separated knowledge into two memories. Parametric memory lives inside the model's weights. It is dense, opaque, expensive to update, and prone to hallucination because the model cannot tell you which training example produced a given output. Non-parametric memory lives in an external document index. It is sparse, transparent, cheap to update, and every retrieval can be traced back to a specific source passage.

The killer feature in the original paper was not accuracy. It was a small experiment buried in the results section. They built one Wikipedia index from December 2016 and another from December 2018, then asked the same RAG model questions about world leaders.

With the 2016 index: 70% correct on 2016 questions, 4% on 2018 questions. With the 2018 index, the numbers inverted: 12% on 2016, 68% on 2018. Same model. Different index. The world's knowledge had been hot-swapped.

"For a chatbot, hot-swappable knowledge is convenient. For a regulated financial product, it is the difference between a viable architecture and a non-starter."

When a client exercises their GDPR right to erasure, you delete rows from a database, you do not retrain a model. When fund performance changes quarterly, you ingest new run summaries, you do not fine-tune. When a regulator asks which document grounded a particular recommendation, you point to a row with a hash. RAG is not just a technique. It is the architecture whose properties, hot-swappable knowledge, traceable retrieval, deletable rows, survive contact with EU financial compliance most cleanly. Other architectures can be made to comply, but at considerably higher cost.

Lewis et al. '20
Lewis, Perez, Piktus et al. · Facebook AI Research / UCL / NYU
Introduced RAG: pre-trained DPR retriever over a Wikipedia index, BART seq2seq generator, end-to-end fine-tuned. Set state-of-the-art on Natural Questions, WebQuestions and CuratedTREC, and on TriviaQA's T5-comparable split. Demonstrated index hot-swapping: replace the non-parametric memory and the model's world knowledge updates without any retraining. The conceptual foundation every later RAG system builds on.

Six years and 100 papers later

The six years between Lewis et al.'s paper and this one were busy. BlackRock-affiliated authors published a multi-agent equity-research RAG architecture last August. Altbridge published the Self-Driving Portfolio framework in April. UZH and the WMO deployed an agentic RAG system in production for climate-finance tracking. The research has caught up with the constraints, and most of it points in the same direction.

By 2024, Yunfan Gao and colleagues at Tongji University and Fudan published a survey reviewing more than 100 RAG studies. Their technology tree shows the field's branches, pre-training-augmented variants, fine-tuning-augmented variants, inference-time variants, all proliferating after ChatGPT made retrieval-augmented architectures suddenly relevant to industry. The survey distilled three paradigms.

Naïve RAG is the demo: BM25 or TF-IDF keyword retrieval, static datasets, retrieve-read workflow. Easy to ship. Falls over on anything that requires semantic understanding.

Advanced RAG adds dense retrieval (DPR), neural reranking, and multi-hop retrieval. This is where most production systems land, or claim to.

Modular RAG decomposes the pipeline into composable components: hybrid sparse+dense retrievers, query rewriters, rerankers, tool integrations, evaluators. Each module can be swapped without rebuilding the whole system.

Gao et al. drew a quadrant chart that should be on the wall of every team building with LLMs. One axis is external knowledge required; the other is model adaptation required. Prompt engineering sits at the bottom-left: low knowledge, low adaptation. Fine-tuning sits at the top-right: high adaptation, knowledge internalized in weights. RAG sits in between: high external knowledge, low-to-moderate model adaptation.

Most teams treat the choice between RAG and fine-tuning as a binary. The survey makes clear it is not. They are complementary. To, Bui and Le (RAG-IT, arXiv:2412.08179, submitted December 2024; latest revision December 2025) proved this concretely: they used a teacher LLM to generate 800 instruction Q&A pairs from chunked NVIDIA and AMD earnings reports, then fine-tuned Llama-2-7b with LoRA on the generated data. The fine-tuned model scored 4.6/10 on correctness against an unseen Broadcom earnings report, vs Llama-2's baseline 2.8 and GPT-3.5's 5.3. RAG was not the inference architecture. RAG was the data factory.

For a private financial model deployed inside a tenant boundary, this changes the conversation. The question is not "RAG or fine-tune?" The question is where on the quadrant your use case lives, and finance lives at the high-external-knowledge, high-compliance end where both axes matter.

We hit this fork in week three of building Nexqion. Fine-tuning was tempting on inference cost, a private Llama-class model is materially cheaper per turn than calling GPT-4o, and a self-hosted model carries less DPA-renegotiation risk than an external embedding API. The GDPR-erasure cost killed it before we wrote the migration: every weight trained against client data is a row you cannot delete. RAG-as-data-factory is the eventual path; today the non-parametric memory carries everything.

Gao et al. '24
Gao, Xiong, Gao et al. · Tongji / Fudan · arXiv:2312.10997
Reviews 100+ RAG studies. Distills three paradigms, Naïve, Advanced, Modular, with a quadrant chart contrasting RAG, fine-tuning, and prompt engineering on external-knowledge and model-adaptation axes. Establishes that RAG and fine-tuning are not mutually exclusive but compose along orthogonal dimensions.
To et al. '25
Hai-Thien To, Tien-Cuong Bui & Van-Duc Le · arXiv:2412.08179 (v3)
Uses RAG to generate fine-tuning datasets, not for inference. GPT-3.5-turbo generates 10 Q&A pairs per chunked earnings-report context across 6 instruction types. Llama-2-7b + LoRA/QLoRA on 800 samples reaches 4.6/10 correctness, vs Llama-2's 2.8 and GPT-3.5's 5.3. Closes the gap with a commercial model at a fraction of the cost.
Singh et al. '25
Singh, Ehtesham, Kumar, Khoei, Vasilakos · arXiv:2501.09136 · January 2025
First principled taxonomy of Agentic RAG along four dimensions: agent cardinality, control structure, autonomy, knowledge representation. Four design patterns (Reflection, Planning, Tool Use, Multi-agent Collaboration) and five workflow patterns (Prompt Chaining, Routing, Parallelization, Orchestrator-Workers, Evaluator-Optimizer). Finance explicitly identified as a primary domain.

Five assumptions that fail in finance

Most RAG tutorials are built on assumptions that hold for general knowledge bases and break in regulated finance. Each assumption has a corresponding research finding that exposes the gap.

Assumption 1 · Documents are distinguishable

True for Wikipedia. False for fund reports. The annual report for Q4 2023 and Q4 2024 of the same fund will share 80% of their language verbatim, boilerplate, methodology, governance disclosures, and the 20% that differs is exactly what the user is asking about. Standard HyDE retrieval generates one hypothetical document per query, embeds it, finds the nearest chunk, and confidently returns the wrong year.

Multi-HyDE (Srinivasan et al., IIT Madras, FinNLP EMNLP 2025) was built specifically to solve this. It generates N non-equivalent hypothetical documents per query, not similar variants but contextually distinct ones (e.g. one about fraud investigations, one about criminal cases, derived from the same query). Combined with BM25 and a cross-encoder reranker, it lifts retrieval recall on the financial-qa-10K subset from 68.85% to 84.84% (Multi-HyDE + BM25 + cross-encoder configuration), and reduces hallucinations by 15% on FinanceBench/ConvFinQA in human evaluation.

We hit this exact failure on a client fund's Q4 2023 vs Q4 2024 reports, the first prototype retrieved the wrong year roughly 40% of the time on questions like "what was last quarter's drawdown?" Multi-HyDE-style disambiguation is on the next-cycle retrieval upgrade.

Assumption 2 · Retrieval quality is the main variable

The most counterintuitive finding in the entire literature comes from a controlled experiment by Hornuf, Streich and Töllich (CESifo Working Paper 11862, April 2025). They tested 7 LLMs on a 12-stock portfolio allocation task across four information conditions, against a benchmark of US financial advisors recruited via incentivized survey.

The result: generic finance theory had zero effect on LLM recommendations, it was already in the pre-training data. Quantitative firm-specific metrics (beta, size, book-to-market, ESG scores, momentum) improved performance significantly. Qualitative 10-K MD&A summaries improved performance less. Numbers beat narratives.

The same paper found something more uncomfortable. Adding firm-specific information to risk-averse investor profiles increased portfolio risk. The paper offers two possible mechanisms: firm-specific data may have reduced ambiguity aversion (prompting allocation to more stocks), or it may have triggered an implicit-request / desirability effect where the LLM read the additional information as a signal to broaden the universe. The authors note these cannot be cleanly distinguished. Either way, the resulting allocations include stocks inappropriate for the stated risk profile. For MiFID-regulated systems, this is a suitability misalignment risk that has to be designed against, not just monitored.

Assumption 3 · One retrieval path is enough

FinSage (Wang et al., SimpleWay.AI / McGill / Toronto, April 2025) achieved 92.51% recall on expert-curated financial questions using four parallel retrieval paths:

  • BM25 for exact-match terminology, ISIN codes, fund names, KPI labels.
  • BGE-M3 dense embeddings for semantic similarity.
  • BGE-M3 over metadata, headings and section summaries embedded separately.
  • HyDE instruction-tuned on financial filings.

Each chunk is stored as a 5-tuple, text, metadata, dense embedding, sparse embedding, metadata embedding. Co-references are resolved at chunk level: pronouns are replaced with antecedents before embedding.

The counter-signal in the same paper is more revealing. GraphRAG, currently heavily marketed, scored a 42.5% pass rate against FinSage's 82.67%. LightRAG scored 13.67%. The graph-RAG hype has not survived contact with real financial filings.

Assumption 4 · The system can decide what to retrieve

In a general-purpose chatbot, dynamic LLM-driven retrieval routing is a feature. In regulated finance, it is a compliance anti-pattern. A data protection officer who has to audit your system will ask one question: which corpus kinds were accessible for which intent? If the answer is "the model decides," the answer is unacceptable.

The correct design is a hardcoded intent → corpus_kinds whitelist, fully auditable, deterministic, surveyable by a compliance officer who has never read an AI paper. This is not a simplification of dynamic routing. It is the architecture that lets you ship under DORA.

Source: converged practice under DORA / MiFID II, no single paper makes this argument; it is what every audited regulated-AI system in the EU has had to settle on.

Assumption 5 · If retrieval fails, the system errors

Vaghefi et al. (EACL 2026, University of Zurich / WMO / Swiss Finance Institute) built an agent-based RAG system for tracking Early Warning System investments in heterogeneous MDB climate-finance documents, PDFs with nested tables, scattered numeric evidence, non-standardized layouts. Their agent runs a self-healing loop: when retrieved context falls below a coverage threshold, it generates a sub-query and re-queries automatically. Their ablation shows that disabling this loop reduces both evidence F1 and total-amount accuracy, particularly on documents with fragmented tables.

Their glass-box agent reaches 87% accuracy, outperforming Gemini 2.5 Flash and OpenAI Assistants by 8–14 percentage points on the same documents. The finding that matters most: glass-box wins decisively wherever exact financial figures are involved; black-box only wins where numeric cues are buried in narrative prose. For a financial AI system, the documents with exact figures are precisely the ones that matter.

The architecture findings, end to end

Stepping back from the broken assumptions: what does the recent literature actually converge on?

On retrieval architecture

Four design choices converge across multiple independent papers:

  • Hybrid sparse+dense beats dense-only on long financial documents (Multi-HyDE ablation).
  • Reciprocal Rank Fusion with k=60 appears in Cheng et al. (IEEE '26) and EW4All as the fusion mechanism for hybrid sparse+dense retrieval. FinSage converges on the same hybrid sparse+dense direction but through multi-path retrieval plus reranking rather than RRF specifically.
  • Neural reranking is no longer optional, the single highest-leverage addition to a financial RAG pipeline.
  • Adaptive cutoffs (cumulative probability ≥55% or score-cliff drop >0.15) prevent context dilution from naive top-k selection.

The single most actionable number comes from the IEEE March 2026 ablation. On a hybrid SQLite FTS5 + FAISS pipeline over 500 S&P 500 10-K reports, with 1,500 queries split across 5 independent test groups on the FinDER benchmark, adding JinaReranker v2 (278M params, multilingual) over the top-30 fused candidates improved correctness from 33.5% to 49.0%, a +15.5pp gain.

Critical errors (Score=1) dropped from 35.3% to 22.5%. Consistent across all five test groups. Without RAG entirely, GPT-4-Turbo scores 9% on the same benchmark.

+15.5pp
Correctness gain from a single reranking pass (Cheng et al., ICECET '26)
92.5%
Recall on expert financial QA with 4-path retrieval (Wang et al., FinSage '25)
9%
GPT-4-Turbo accuracy on closed-book financial QA (FinDER baseline)
87%
Glass-box agent accuracy on MDB filings (UZH / WMO, EACL '26)

On corpus design

Numbers beat narratives. Generic finance theory is wasted token budget. Quantitative firm-specific metrics are the highest-value content per byte. Qualitative summaries help, but less. This is the corpus-composition finding from Hornuf et al., and it directly contradicts how most teams build financial RAG corpora, by stuffing in every PDF they can find.

On agentic architecture, from practitioners

BlackRock published AlphaAgents in August 2025 (Zhao, Lyu, Jones, Garber, Pasquali, Mehta). Three GPT-4o agents, each with their own toolkit:

  • Fundamental,10-K RAG over yfinance API.
  • Sentiment, Bloomberg news with reflection-enhanced summarisation.
  • Valuation, OHLCV calculator with mathematical volatility / return computation.

Microsoft AutoGen GroupChat orchestrates Round Robin debate until consensus. Risk tolerance is embedded via role prompting, risk-averse vs risk-neutral.

Risk-neutral profile outperformed equal-weight benchmark over a 4-month back-test on 15 tech stocks. Risk-averse profile underperformed: it correctly excluded high-volatility stocks, which during a tech bull market were exactly the winners. This is a concrete caution against treating prompt-based risk profiling as fail-safe.

They also observed something subtler: risk-seeking and risk-neutral prompts produced near-identical outputs. Adjacent risk profiles do not differentiate meaningfully under prompt engineering alone, a finding that should temper any team building MiFID suitability with prompts and hope.

Ang, Azimbayev and Kim (April 2026) push the same architecture much further. Their Self-Driving Portfolio deploys ~50 specialized agents in a six-stage pipeline:

  1. Macro Agent, regime classification (expansion / late-cycle / recession / recovery) with confidence scores.
  2. Asset Class Agents, in parallel, each running multiple CMA methods plus a CMA judge.
  3. Covariance Agent, historical data plus macro forecasts feed an asset-class covariance matrix.
  4. Portfolio Construction Agents,20+ methods in parallel: equal-weight, MVO, risk parity, HRP, Black-Litterman, max diversification, min variance, max entropy.
  5. Strategy Review, Borda-count voting, randomised peer review, adversarial diversifier, CRO risk assessment.
  6. CIO Ensemble, combines proposals; produces final portfolio plus board memo with dissenting views.

A meta-agent closes the feedback loop. It compares past forecasts against realised returns and rewrites agent code and prompts to improve future performance.

On governance

The Self-Driving Portfolio paper introduces the framing that ties everything together. The Investment Policy Statement is the operational design domain for the agent fleet, borrowed by analogy from SAE J3016 levels of autonomous-vehicle autonomy. L3 systems recommend within IPS parameters and require human approval. L4 systems operate autonomously within constrained domains. The IPS does for portfolio agents what a compliance profile does for retrieval agents: it encodes institutional constraints as machine-readable rules.

This reframed how we think about the compliance_profile kind in Nexqion's corpus. It is not metadata to retrieve when the user asks about compliance, it is the operational design domain that constrains the entire pipeline. The kinds whitelist is one mechanism that enforces it; the audit log is another; tenant scope is a third. The IPS-as-ODD framing makes it considerably easier to defend that architecture to a regulator.

Srinivasan et al. '25
Srinivasan et al. · IIT Madras · EMNLP 2025
Multi-HyDE generates multiple non-equivalent hypothetical documents per query, solves the year-disambiguation problem in long-document financial corpora. Hybrid BM25+dense + cross-encoder reranker. +11.2% accuracy over baseline; recall 84.84% vs 68.85% on financial-qa-10K (cross-encoder configuration); -15% hallucinations on FinanceBench/ConvFinQA in human eval.
Hornuf et al. '25
Hornuf, Streich & Töllich · CESifo Working Paper 11862 · April 2025
Pre-registered controlled experiment: 7 LLMs, 12 stocks, 4 information conditions vs human financial advisors. Quantitative firm-specific data improves recommendations significantly more than qualitative 10-K summaries. Generic finance theory has zero effect, already in pre-training data. Risk-averse profiles see increased portfolio risk when firm-specific info is added, a MiFID suitability concern.
Wang et al. '25
Wang et al. · SimpleWay.AI / McGill / Toronto · arXiv:2504.14493
Four-path retrieval (BM25 + BGE-M3 + Metadata + HyDE) with DPO-tuned reranker. 92.51% recall on expert questions; +24.06% on FinanceBench. GraphRAG: 42.5% pass rate. LightRAG: 13.67%. Production-deployed serving 1,200+ users.
Cheng et al. '26
Cheng, Lai, Liu, Cheng & Qi · ICECET 2026 · arXiv:2603.16877 · FinDER benchmark, 1,500 queries
Adding JinaReranker v2 to a hybrid FTS + semantic pipeline improved correctness from 33.5% to 49.0%,+15.5pp. Critical errors dropped 12.8pp. Consistent across 5 independent test groups. The most actionable number in the literature for production RAG.
Vaghefi et al. '26
Vaghefi, Hachcham, Grasso et al. · UZH / WMO / SFI · EACL 2026
Glass-box agent on heterogeneous MDB documents: hybrid dense+BM25F, context-augmented chunking, self-healing retrieval loop, evidence grounding. 87% accuracy, outperforms Gemini 2.5 Flash and OpenAI Assistants by 8–14pp. Key finding: glass-box wins on documents with exact figures; black-box only wins where numbers are buried in narrative prose.
Zhao et al. '25
Zhao, Lyu, Jones et al. · BlackRock · arXiv:2508.11152
Three GPT-4o agents (Fundamental + Sentiment + Valuation) via AutoGen Round Robin debate. Risk-neutral outperforms benchmark; risk-averse underperforms in tech bull market, a concrete caution about prompt-based risk profiling. Risk-seeking and risk-neutral prompts produce near-identical outputs.
Ang et al. '26
The Self-Driving Portfolio: Agentic Architecture for Institutional Asset Management
Ang, Azimbayev & Kim · April 1, 2026
~50 specialized agents, Borda-count voting across 20+ portfolio construction methods, meta-agent rewrites code/prompts after comparing forecasts to realized returns. The Investment Policy Statement is the operational design domain, L3/L4 autonomy framing borrowed from SAE J3016. The intellectual frame for governed agentic finance.

What we built at Nexqion

Here is what that converged stack looks like when you actually have to ship it.

The Nexqion product is the Alpha Quant Agent, a planning and analysis system that reasons over a client's complete investment context: fund structure, performance history, methodology documentation, compliance constraints, past report edits, communication style. The goal is to generate the kind of analysis that normally requires a senior analyst spending half a day pulling sources together.

Building it required answering, stage by stage, how a glass-box RAG pipeline should be designed under regulatory pressure. The pattern that emerges across the recent literature is a six-stage flow with three compliance gates, universal in skeleton, varied in instantiation. Different institutions will tailor the gates to their own regulatory regime; DORA, MiFID, GDPR and SEC each push different design choices on top of the same backbone.

Reference Architecture, Glass-Box RAG for Regulated Finance
A real glass-box RAG. Four structured source categories, workflow outputs, the data catalog (with explicit data profiles), the reference library, and compliance config, feed the indexing pipeline (left, dashed). Queries flow through an intent classifier and authorisation gate before retrieval (right, solid). Hybrid sparse+dense retrieval (BM25 ⊕ vectors via RRF) returns top-N candidates; a cross-encoder reranker selects top-K (the +15.5pp single biggest correctness gain in the literature). The kind-aware context assembler then hands the LLM a deterministic, budgeted prompt. Every retrieval event is captured by the audit log. Compliance gates, in code, not policy, sit at the authorisation step, the tenant-scoped SQL filter, and the audit log capture.

How each stage should be designed

The recommendations below describe what each stage should do and why. They are the converged pattern across the recent literature, not specific to any one institution.

Stage Design recommendation
01 Intent classification. Map the user query to one of a finite, code-defined enumeration of intents, five to ten is enough for most institutions. Avoid free-form LLM routing: a compliance officer must be able to read the intent taxonomy in a single file and understand what it means without prior AI knowledge. Ship golden examples per intent so the classifier is testable.
02 Authorization gate. A hardcoded intent → permitted corpus kinds mapping. Defined in code, reviewable in one file, never decided at runtime by an LLM. This is what makes "which corpus kinds were accessible to this user on date X" a one-line answer instead of an unanswerable question. The default for an unrecognised intent should be the most-restrictive kind set, not the most-permissive.
03 Query embedding. Normalise vectors to unit length so dot-product equals cosine similarity without renormalising downstream. Provider-abstract the embedding call so per-tenant residency overrides (DACH strict, US default, etc.) become an env-var change, not a refactor. Fail-soft on provider errors: degrade to no-RAG-hints rather than returning a hard error.
04 Scoped retrieval. Tenant scope is enforced at the storage layer (SQL WHERE clause, row-level security, or equivalent), never the application layer. Application-layer scoping fails the moment a developer writes one query that forgets the filter. The kind filter from stage 02 is applied alongside scope. Hybrid sparse+dense retrieval and a neural reranker (per IEEE 2026, +15.5pp) are the highest-leverage upgrades once basic retrieval works.
05 Context assembly. Group hits by kind, not by raw score. Apply a per-kind char or token cap so one chatty hit cannot crowd everything else out. Render kinds in a priority order encoded by the institution's evidence preferences (Hornuf et al.: quantitative beats qualitative). Under a global cap, evict the lowest-priority sections wholesale rather than truncating mid-section, this preserves prompt-cache determinism (same hits → same string).
06 Grounded generation + audit. Log every retrieval event with: query hash (not raw text, GDPR), tenancy keys (split across user, org and scope), kinds filter, top-k record IDs, model, latency, timestamp. Retain for the regulatory window the institution operates under. The audit log itself must be erasable on a tenant-erasure request, GDPR right to erasure cascades through it, not just the source documents.

An instantiation, how Nexqion implements it

The recommendations above are the framework. Below is one specific instantiation: how the Alpha Quant Agent answers each design question today, what it ships, and what closes in the next cycle.

System Architecture, Client-Aware Planner Context

embeddings.py
Provider abstraction for text-embedding-3-small (1536-dim) with sentence-transformers all-MiniLM-L6-v2 local fallback (384-dim) for DACH strict-residency tenants. All vectors L2-normalised before storage. Fail-soft on provider error.
rag_store.py
pgvector (Postgres) backend with lazy IVFFlat index, created once a scope holds ≥1,000 vectors, lists scaled to data size. SQLite-numpy fallback for local dev. Every row carries a scope_key enforced at the SQL layer; the Postgres role used by the API has no permission to query without a scope predicate.
client_knowledge_base.py
Corpus manager defining ten document kinds with priority-based context assembly under a 4,096-char budget. Top-5 retrieval. Kind-aware intent whitelist lands in the next cycle.
org_profile
compliance_profile
methodology
fund_profile
style_sample
run
report_edit_pattern
dataset_summary
schedule
reference_report

Three of the design questions above are answered today; two close in the next cycle.

Stage 04, Tenant isolation at the SQL layer. The scope_key is a column on every embeddings row, every retrieval query filters at the WHERE clause, and the Postgres role used by the API has no permission to query without a scope predicate. Application-layer scoping was rejected as a design, it fails the moment a developer writes one query that forgets the filter.

Stage 03, Fail-soft on embedding errors. If the embedding call fails for any reason, the planning step runs without RAG hints rather than returning a hard error. The user still gets an answer, just one flagged as having reduced contextual grounding. This mirrors the EW4All self-healing posture for the partial-failure case.

Stage 05, Kind-grouped context assembly with priority eviction. This is where the most interesting piece of running code lives. Hits come back from the vector search ungrouped. The prompt-block formatter groups them by kind, applies a per-kind character cap, renders the kinds in priority order, then enforces a global 4 KB total cap by evicting the lowest-priority sections wholesale rather than truncating mid-section.

One data structure governs all of it:

# client_knowledge_base.py, what governs the 4 KB context budget
_KIND_DISPLAY = (
    ("org_profile",         "Firm profile",      800),  # highest priority
    ("compliance_profile",  "Compliance",        800),
    ("methodology",         "Methodology",       800),
    ("fund_profile",        "Funds",            1000),
    ("style_sample",        "House style",      1200),
    ("run",                 "Past runs",         700),
    ("report_edit_pattern", "Edit patterns",     500),
    ("dataset_summary",     "Dataset notes",     500),
    ("schedule",            "Schedules",         400),
    ("reference_report",    "Reference packs",   600),  # lowest
)
_PROMPT_BLOCK_TOTAL_CAP = 4096
Each tuple defines a corpus kind, its display label, and its per-kind char cap. Order is priority, sections render top-down and are evicted bottom-up under the global cap. A change to any of these is a one-line diff reviewable by a non-engineer.

Why wholesale eviction of low-priority sections rather than partial truncation? Two reasons. First, prompt-cache determinism, the same retrieval-hit set must produce the same context string every turn so OpenAI's prompt cache keeps hitting; partial truncation produces near-identical-but-not-identical strings on every turn and destroys the cache. Second, losing a partial methodology section confuses the planner more than losing a low-signal schedule section entirely.

Hornuf et al.'s finding that quantitative firm-specific data outperforms qualitative narratives is encoded directly in the priority order: fund_profile and compliance_profile sit near the top; style_sample and reference_report sit at the bottom. The corpus is not a flat bag of chunks. It is a budgeted, priority-ranked, deterministic ledger.

Stages 01–02, Intent classification and authorization gate. These are the next-cycle close. The data structure is settled, a deterministic intent → kinds mapping defined in one file, and the storage layer already accepts a kind filter. What remains is the orchestrator passing the right list into the retrieval call. Cycle 8.

# client_knowledge_base.py, the next-cycle close (cycle 8)
KINDS_BY_INTENT: dict[ChatIntent, list[CorpusKind]] = {
    "performance":   ["run", "fund_profile", "dataset_summary"],
    "compliance":    ["compliance_profile", "org_profile", "methodology"],
    "methodology":   ["methodology", "fund_profile", "reference_report"],
    "narrative":     ["style_sample", "report_edit_pattern", "reference_report"],
    "attribution":   ["run", "methodology", "fund_profile"],
}
The whitelist format that lands in cycle 8. The store API already supports kind filtering at the SQL layer (WHERE kind IN (…)); the missing piece is the orchestrator passing kind=KINDS_BY_INTENT[intent] into store.search(). When this lands, the figure above is accurate end-to-end at Nexqion.

The roadmap, grounded in numbers

Four changes follow directly from the findings above:

  • 1
    Switch IVFFlat → HNSW once tenant scale crosses the index threshold. Today, IVFFlat is created lazily once a tenant has ≥1,000 vectors, with lists scaled to min(sqrt(count), 100); below that we sequential-scan inside the scope, which is faster anyway. That is correct for current scale but suboptimal once tenants reach 10K+ vectors. HNSW is more accurate at every size below 100K rows with comparable query latency. pgvector supports HNSW since v0.5.0.
  • 2
    Add neural reranking, JinaReranker v2. The IEEE March 2026 ablation makes the case unambiguous: +15.5pp correctness from a single cross-encoder pass over top-30 candidates. 278M parameters, multilingual (108 languages, relevant for DACH clients). Adaptive cutoffs (cumulative probability ≥55% or score cliff >0.15) prevent context dilution.
  • 3
    Implement hybrid BM25 + vector with RRF. Pure cosine similarity loses on exact-match terminology, fund names, ISIN codes, regulatory references, KPI labels. The RRF formula 1/(60 + r_FTS) + 1/(60 + r_semantic) is implementable directly against the existing pgvector + Postgres FTS stack.
  • 4
    Build the evaluation harness. Without a way to measure Recall@5, MRR, and faithfulness against held-out user sessions, every change above is invisible. This is the prerequisite for everything else. The metrics framework from FinSage (Faithful evaluation, mean score, pass rate, response time) is the template.

Where this is going

The Self-Driving Portfolio paper's framing is the trajectory. The Investment Policy Statement is the operational design domain for an agent fleet. The compliance profile in Nexqion is the same construct one layer down, it constrains the retrieval and analysis agents the way an IPS constrains a portfolio construction agent fleet.

Glass-box modular RAG with intent whitelists, scope-keyed tenancy, evidence grounding, audit-trail retrieval logging, and fail-soft degradation is not a constraint on what we can build. It is what makes the system trustworthy enough to put in front of a fund manager. The technical infrastructure and the compliance posture are the same thing.

The architecture is governed. The roadmap is specific. What's deliberately missing from this issue: the runtime numbers. We are not publishing tenant count, vector count, or p95 retrieval latency until the eval harness in Issue 05 makes those measurements honest. A "From Genesis to Production" article without measurements would be marketing; we would rather earn the title in the next issue than claim it now.