What Is RAG for Data Engineering? Retrieval, Context & Agent Accuracy
Retrieval-augmented generation (RAG) augments an LLM prompt with retrieved facts — in data engineering, usually schema fragments, metric definitions, and historical SQL — before the model generates an answer or query. For text-to-SQL and data agents, RAG is how context scales beyond a single prompt window. This glossary entry defines RAG, how it differs from dumping your whole warehouse into chat, and how it connects to a persistent context engine.
Disclosure: Datus is a data engineering agent platform. This article explains RAG as a general retrieval architecture, referencing Datus alongside other tools and approaches in the category. See the end for more detail.
TL;DR
- RAG = retrieve relevant documents → inject into prompt → generate SQL or explanations.
- In data engineering, retrieval targets schemas, semantic models, reference SQL, and runbooks — not generic web pages.
- Bad retrieval causes confident wrong SQL — worse than admitting "I don't know."
- RAG in a session is not enough; production systems need durable indexes that update with feedback.
- Production data RAG uses structured context navigation and vector search over institutional SQL to inject the right schema, metrics, and reference queries into each generation — not a one-size-fits-all retrieval dump.
1. RAG: a working definition
RAG was introduced for knowledge-heavy QA: retrieve passages from a corpus, add them to the model context, then generate an answer grounded in those passages.
For data engineering, adapt the definition:
RAG for data engineering is retrieving structured and semi-structured data context — metadata, metrics, validated SQL, policies — and injecting it into an LLM workflow so generation is anchored in your warehouse, not the model's training prior.
Generic chatbots hallucinate table names. Data RAG grounds generation in objects that exist and queries that already succeeded. To understand the difference concretely, consider what happens without data RAG. An analyst asks "monthly active users by plan tier for Q1." A general-purpose LLM, unaided by retrieval, generates SQL referencing tables it invented — users, subscriptions, plans — using column names it inferred from common schema patterns. The SQL looks plausible but it was written against imaginary tables. A data RAG system with access to the actual warehouse retrieves dim_user, fact_subscription, dim_plan, and fiscal_calendar — tables that actually exist — along with the correct join keys and grain definitions. The same model, with retrieval, produces SQL that runs correctly against real tables. Without retrieval, it produces SQL that looks correct but references nothing.
2. RAG vs fine-tuning vs long context
| Approach | What it changes | Fit for data engineering |
|---|---|---|
| RAG | What the model sees this turn | Best for volatile schema and fresh SQL history |
| Fine-tuning | Model weights | Expensive; stale when schema changes weekly |
| Long context | Bigger prompt window | Still noisy if you paste 4,000 tables |
Schema drift favors retrieval over baking warehouse details into weights. Fine-tuning can complement RAG for style and dialect, but it does not replace live metadata. The practical test is simple: your data engineering team deploys a new fact_subscription_v2 table on Tuesday morning. A RAG system, re-indexed overnight, retrieves the new table for subscription questions on Wednesday. A fine-tuned model continues generating queries against fact_subscription_v1 because its weights were frozen two months ago — and no one will notice the error until a quarterly report shows subscription counts that differ from the billing system by 8%. RAG propagates schema changes as fast as the index updates. Fine-tuning propagates them as fast as the retraining pipeline runs — which, in most organizations, is measured in quarters, not days.
3. What to retrieve for text-to-SQL and agents
High-value retrieval sources:
| Source | Retrieved content | Helps with |
|---|---|---|
| Table/column metadata | Names, types, keys | Schema linking |
| Semantic layer exports | Metric YAML, dimensions | Business terms → logic |
| Reference SQL | Past correct queries | Join paths, filters, grain |
| Issue / feedback notes | Corrections, deprecations | Institutional memory |
| Subagent scope manifest | Allowed tables/metrics | Safety and focus |
Low-value retrieval: entire data dictionaries with no ranking, outdated Confluence exports, or random Slack messages without validation. The difference between high-value and low-value retrieval is not the content type — it's whether the content carries executable signal. A data dictionary that says fact_orders.amount_usd: Order amount in USD is low-value because it tells the model what the column is called but not how to use it. A reference SQL query that says SUM(fact_orders.amount_usd) WHERE status != 'refunded' is high-value because it shows the model exactly how this column was used to produce a correct answer last quarter. The distinction between descriptive metadata and executable context is the single most important design decision in building a data RAG pipeline.
4. The data RAG pipeline
A typical production loop:
- Index — chunk schema docs, metric specs, SQL files; embed or keyword-index
- Query — user question → retrieval query (often hybrid: semantic + keyword)
- Rank & filter — top-k chunks; apply ACL and Subagent scope
- Compose prompt — system rules + retrieved context + user question
- Generate — SQL or multi-step plan
- Validate — execute, compare to expectations, capture feedback
- Re-index — promoted corrections become tomorrow's retrieval
Step 7 separates agent RAG from demo RAG. Without feedback into the index, retrieval repeats the same mistakes.
Let's trace this pipeline with a real scenario. An analyst asks "month-over-month new customer ARPU by acquisition channel." The indexing step (1) has already chunked the Finance domain's schema, the Marketing domain's attribution tables, and 40 validated reference SQL queries tagged with "customer" and "ARPU." The query step (2) runs a hybrid search: keyword search matches "ARPU" against metadata and SQL text, while vector search finds semantically similar questions like "average revenue per user by signup source." The ranking step (3) surfaces the top 15 chunks — including a reference SQL query from last month that computed new customer ARPU using fact_subscription.first_payment / COUNT(DISTINCT dim_user.user_id) — and filters out chunks from the Sales domain that use a different ARPU definition. The compose step (4) assembles a prompt with the Finance domain rules, the retrieved reference SQL as few-shot examples, and the analyst's question. The generate step (5) produces SQL that uses the same join path and filter logic as the reference example. The validate step (6) executes the query and compares the result count against a cached approximate answer from the BI dashboard. The re-index step (7) stores this validated query as a new retrieval candidate for future ARPU questions — so next month's ARPU question retrieves an even richer set of examples.
Contextual data engineering describes that evolution loop as the operating system for agents.
5. Failure modes specific to data RAG
Retrieval misses the right table
The embedding model associates "channel" with marketing tables when finance meant sales territory. Mitigation: subject-tree organization, aliases, and reference SQL priors. The root cause is semantic proximity in embedding space — "channel" in a marketing context is near dim_marketing_channel, but in a sales context it maps to dim_sales_territory. A flat vector index has no mechanism to distinguish context. A hierarchical subject tree does: the query is first routed to the Finance domain's index, where "channel" embeddings are dominated by sales territory references, not marketing channel references.
Retrieval includes contradictory snippets
Two "official" definitions of revenue from different eras. Mitigation: versioning, certification flags, and feedback-weighted ranking. The 2024 definition of revenue (pre-refund, pre-chargeback) and the 2025 definition (post-refund, post-chargeback) both appear in the index. Without version metadata, the model sees both and must guess — and it often guesses wrong because the older definition appears in more documents. Feedback-weighted ranking solves this by deprioritizing snippets from queries that were corrected, so the 2024 definition gradually sinks in retrieval rank as corrections accumulate.
Over-retrieval blows the context budget
Too many chunks dilute attention. Mitigation: scoped Subagents and hierarchical navigation instead of flat dumps. A global retrieval over 4,000 tables returns 25 candidate chunks, of which 15 are irrelevant to the analyst's domain. The model's attention is split across 15 noise chunks before it even begins generating SQL. A scoped Subagent for the Finance domain retrieves from 50 tables and 200 reference SQL queries — returning 8 chunks, all relevant, all within the model's effective attention window.
Stale index
New column not indexed; model never sees it. Mitigation: catalog sync jobs and incremental re-embedding after pipeline deploys. The failure is silent: the new column exists in the warehouse, but queries against it fail because the model cannot retrieve it. The monitoring signal is a spike in "column not found" execution errors after a known schema deployment — if the deployment happened Monday and errors spike Tuesday, the index sync is broken.
No execution grounding
Model cites retrieved SQL but does not run it. Mitigation: agent tools that execute, inspect errors, and retry — as in build your first data engineering agent. The retrieval provides a plausible join path; execution validates whether that path actually works on today's data. Without execution, the RAG pipeline is a citation engine, not a verification engine — it tells the model what someone did last month without checking whether that approach still applies.
6. RAG and the semantic layer
A semantic layer is a high-quality retrieval source for certified metrics. It is incomplete for ad-hoc validated SQL and tribal rules that never became YAML.
| Context source | Strength | Gap |
|---|---|---|
| Semantic layer | Governed KPIs | Slow PR cycle for edge cases |
| Data catalog | Discovery metadata | Not executable logic |
| Reference SQL RAG | Production-proven joins | Needs curation and ACL |
| Feedback store | Corrections at speed | Needs governance to promote |
The Context Engine pattern treats retrieval as a unified surface over catalog metadata, semantic layer exports, and reference SQL — not a separate vector database beside the chat window. Physical and subject trees structure what gets retrieved; vector search finds similar historical SQL when wording differs across domains.
7. Vector search vs keyword search in data RAG
Keyword search excels for exact table names (fact_orders_v2). Vector search excels when users say "sales by geography" but stored SQL says region_name. Production systems often use hybrid retrieval plus metadata filters (database, domain, owner).
The hybrid approach addresses a failure mode that pure vector search and pure keyword search each suffer in isolation. Vector search retrieves "monthly invoice total" → SUM(fact_invoices.amount) because "invoice" and "bill" are semantically close. But it misses fact_billing.cycle_total because "billing" and "invoice" have moderate semantic similarity and cycle_total has no retrieval relationship to "monthly." Keyword search catches fact_billing.cycle_total because "billing" is in the table name, but it misses fact_invoices.amount because "monthly invoice total" has no keyword overlap with amount. A hybrid system retrieves both candidates and ranks them together, surfacing the correct table whether the naming convention follows the analyst's vocabulary or diverges from it.
Common vector backends used in data RAG pipelines include LanceDB, pgvector, and Milvus — chosen based on whether the deployment prioritizes embedded (LanceDB), PostgreSQL-native (pgvector), or distributed scale (Milvus) retrieval. Navigable catalog and subject interfaces complement vector search for engineers who know which domain and topic to scope their query to before running retrieval.
8. RAG inside MCP and multi-agent setups
When a general-purpose agent calls a data tool via MCP, the data side still needs RAG — the host model does not magically know your warehouse. The data tool must expose the same retrieval index to the MCP client that it uses for its own generations.
See MCP and data engineering for the integration pattern.
9. A RAG failure timeline: how stale retrieval produces wrong SQL
9:03 AM. A product analyst asks "new user retention by cohort for Q1." The company's text-to-SQL system runs a hybrid retrieval query against the Product domain's index. The index was last rebuilt three weeks ago.
9:03:05. Retrieval returns eight chunks. The top-ranked chunk is a reference SQL query from December 2025 that computed retention using the dim_user.first_payment_date column as the cohort anchor. The model generates SQL following that pattern. The SQL executes and returns retention numbers — 68% day-7 retention for the January cohort, aligning with expectations.
9:04. The analyst checks the numbers against a Looker dashboard and finds a discrepancy: the dashboard shows 71% for the same cohort. The difference is small enough to attribute to data latency or rounding, so the analyst moves on.
What went wrong. On January 15 — after the RAG index was last rebuilt — the data engineering team deployed a schema change: they added a cohort_date column to dim_user that correctly handles users who upgraded from free to paid mid-cohort, reassigning their cohort start to the upgrade date. The old first_payment_date column still exists but now undercounts retention for the free-to-paid segment by 3–5 percentage points. The RAG index, three weeks stale, never indexed the new cohort_date column or the reference SQL queries that use it. The model generated SQL using the only retrieval result it saw — the December query — and produced a plausible but wrong answer.
The fix. After a quarterly audit catches the 3% discrepancy, the team configures an index sync job that triggers on schema deployments: when dbt Cloud reports a successful model build, an event fires that re-indexes the affected tables, columns, and reference SQL queries within the same domain. The next "retention by cohort" query retrieves the cohort_date column and the updated reference SQL using it. The accuracy gap closes — not because the model improved, but because the context caught up with the schema.
This pattern — stale index producing confident wrong answers — is the most common RAG failure mode in production data engineering. It is also the most invisible because the symptoms (small discrepancies, plausible results) rarely trigger alarms until a quarterly reconciliation catches them.
Conclusion
RAG for data engineering is not "connect a PDF to ChatGPT." It is a governed retrieval layer over schema, semantics, and validated SQL that makes text-to-SQL and agents accurate at scale. The differentiator is not retrieval itself — everyone has embeddings in 2026 — but whether retrieved context updates with production feedback and respects domain scope. That is the move from demo RAG to contextual data engineering.
Frequently asked questions
How do I know if my RAG pipeline is the problem?
Run the same question five times on five different days and compare the generated SQL. If the model produces different table choices across runs — sometimes fact_orders, sometimes fact_payments for the same "revenue" question — your retrieval is returning inconsistent or noisy chunks. A healthy RAG pipeline should produce consistent schema choices for the same business question, because the retrieval should consistently return the same high-quality reference SQL and metric definitions. Inconsistent retrieval is almost always a ranking or scoping problem, not an embedding problem.
How is data RAG different from enterprise search?
Enterprise search helps humans find documents — an analyst types "customer churn definition" and gets back a Confluence page. Data RAG feeds machines executable context — schemas, metrics, SQL — with ranking tuned for generation quality and ACLs tuned for automation. The ranking considerations are fundamentally different: enterprise search optimizes for human relevance (did the analyst find the right doc?), while data RAG optimizes for generation accuracy (did the model produce correct SQL after seeing these chunks?). A chunk that is highly relevant to a human reader — like a detailed explanation of churn methodology — may be useless for generation if it contains no executable SQL or column mappings.
Does RAG replace a semantic layer?
No. Semantic layers define governed metrics; RAG selects which definitions and SQL examples apply to this question. They are complementary layers with different failure modes: a semantic layer fails when the question does not map to a certified metric (the ad-hoc 80%), while RAG fails when retrieval returns the wrong context (garbage-in, garbage-out). The combination — RAG that preferentially retrieves semantic layer artifacts when they exist, and falls back to reference SQL when they do not — covers both paths.
Why do RAG demos fail in production?
Usually stale or noisy indexes, no scope, and no feedback loop — not because retrieval is the wrong idea. The demo RAG pipeline is a static index built once from a clean schema. The production RAG pipeline inherits three problems the demo never faced: (1) the schema changed last week and the index was not rebuilt, (2) the retrieval returns chunks from three different domains with conflicting definitions, and (3) no feedback mechanism exists to suppress chunks that have caused errors in the past. Fixing these requires operational infrastructure — index sync jobs, domain scoping, and feedback-weighted ranking — that the demo never needed because the demo ran exactly once.
Related articles
- What is text-to-SQL? — generation task RAG supports
- What is schema linking? — retrieval quality determines link quality
- How a context engine improves agent accuracy — RAG as part of persistent context
Disclosure: Datus is a data engineering agent platform. This glossary entry explains RAG as a general retrieval architecture and how Datus implements it — through Context Engine structures, hybrid retrieval, and feedback-driven index evolution.