Contextual Data Engineering: Why Every Data Engineering Agent Needs Evolvable Context
The largest cost in data engineering is not writing SQL. It is re-learning the same tables, the same metrics, the same business rules, over and over, every time a new question arrives. A data engineer joins a team and spends weeks absorbing context that exists only in people's heads. An analyst files a ticket, waits three days, and receives a query that the engineer wrote last quarter for someone else and cannot find. A new hire asks "what does 'revenue' mean here?" and gets three different answers from three different people.
This is not a tooling problem. It is a context problem. And it is the problem that contextual data engineering is designed to solve.
If you have read what a data engineering agent is, you already know the rough shape of what follows. That article argued that the deepest split between today's data engineering agents is how they handle context. This article gives the idea its own name and its own definition.
TL;DR
- Contextual data engineering treats data context—schemas, metrics, validated SQL, business rules, and feedback—as a first-class, evolvable asset, not a one-time modeling exercise.
- It sits beneath a data engineering agent the way an operating system sits beneath an application: the agent runs on top of it, and every run makes the context stronger.
- Traditional data engineering models context once and consumes it statically. Contextual data engineering runs a feedback loop: agent operates → users correct or confirm → context evolves → next run is more accurate.
- Context operates in three layers: schema metadata (what exists), business semantics (what it means), and institutional memory (what we learned the hard way). Most tools stop at layer one.
1. The problem: context is the bottleneck, not compute
Consider a realistic scenario. A data engineer at a mid-stage company receives this Slack message:
"Hey, can you pull weekly net revenue by region for Q2? Exclude test accounts. And use the same definition the finance team uses for their board deck."
This is a five-minute task if the engineer already knows:
- Which table holds revenue transactions (
fact_ordersorfact_revenue_v2? One was deprecated in March.) - How "net revenue" is defined by finance (gross minus refunds minus chargebacks, in USD, at the FX rate locked on the close date—not the transaction date.)
- Which column identifies test accounts (
account_type = 'test'oris_test_account = true? Depends on the source system.) - Which join path is authoritative for region mapping (three possible foreign keys, one is correct, the other two produce silently wrong duplicates.)
Without that context, the same task takes hours: reading table schemas, tracing join keys, digging through old SQL, pinging three people on Slack, and writing test queries to verify assumptions. The engineer spends four hours to produce one query. And next week, someone asks a slightly different question about the same table, and the cycle repeats.
This is the central observation of contextual data engineering: the bottleneck is not query execution speed, and it is not even LLM intelligence. It is the availability of reliable, reusable, evolvable context at the moment a question is asked.
2. A working definition
Contextual data engineering is the practice of treating data context as a first-class engineering artifact—one that is continuously generated, validated, updated, and consumed by agents and humans—rather than as a one-time documentation exercise that goes stale the week after it is written.
Four properties distinguish it from traditional approaches:
- Context is alive. It is not a static document or a YAML file committed once. It is generated from real usage (historical SQL, user feedback, schema inspection), updated continuously, and versioned so teams can trace how definitions evolved.
- Context is multi-layered. It covers schema metadata (physical tables and columns), business semantics (metrics, dimensions, join rules), and institutional memory (validated SQL, deprecation notes, feedback history). Most tools stop at layer one.
- Context is consumed by agents, not only by humans. A data catalog helps a person find a table. A context engine helps an agent generate a correct query—grounded in schema, metrics, reference SQL, and business rules—without a human translating every column name.
- Context compounds. Every successful agent run becomes training data for the next one. Every correction flows back into the system. Over weeks, the agent stops being a clever stranger and starts being a junior engineer who has actually worked on your codebase.
In short: contextual data engineering is what turns a data engineering agent from a chat window into a system that accumulates institutional knowledge.
3. Three layers of context, and why most tools stop at one
The most useful way to think about context is as a stack of three layers. Each layer depends on the one below it, and most tools in the modern data stack address only the first.
Layer 1: Schema metadata — "what tables and columns exist"
This is the foundation. A data catalog crawls your warehouse and builds an inventory: databases, schemas, tables, columns, types, primary keys, foreign keys. Tools like DataHub, Atlan, and Select Star do this well. Platform-embedded agents like Google's BigQuery Data Engineering Agent lean heavily on this layer, using the platform's own Knowledge Catalog to ground queries.
Schema metadata tells an agent that fact_orders has a column called amount_usd of type DECIMAL(18,2). It does not tell the agent that amount_usd means net revenue after refunds, at locked FX rates, excluding test accounts—and that another column called revenue_usd in the same table means gross revenue before any adjustments.
Layer 2: Business semantics — "what the data means"
This is the layer where a semantic layer lives: metrics, dimensions, entities, join rules, and business logic. Tools like MetricFlow (dbt), Cube, and LookML operate here. A well-built semantic layer tells an agent that "net revenue" is sum(amount_usd) from fact_orders filtered by order_status != 'cancelled', joinable to dim_region via region_id, sliceable by region_name and fiscal_quarter.
Most organizations stop here and consider the job done. The semantic layer is modeled, reviewed in a PR, deployed, and consumed by BI tools. It is static, governed, and correct—at the moment it was written.
Layer 3: Institutional memory — "what we learned the hard way"
This is the layer that contextual data engineering adds. It captures information that never makes it into a formal semantic model but is essential for correct answers:
- Validated SQL that has been tested in production and proven correct—not every useful query becomes a formal metric, but every successful query is evidence of how to use the data correctly.
- Deprecation notes ("don't use
statusbefore March 2026; usestatus_v2") that live in Slack threads and engineering brains, not in any system. - Feedback history—which past agent outputs were upvoted, which were corrected, and why. This is training data for accuracy.
- Business rules that are "obvious" to the team but invisible to an agent: test accounts are filtered by
account_type != 'test'in the source system but byis_internal = falsein the warehouse; revenue for the board deck uses locked FX rates while operational reports use spot rates.
Layer 3 is where contextual data engineering earns its name. Without it, even a perfect semantic layer produces wrong answers when the question touches institutional knowledge that was never formalized. With it, the agent learns what the team knows—including the things nobody remembered to write down.
A schema tells an agent what exists. A semantic layer tells an agent what it means. Institutional memory tells an agent what to trust—and what to avoid.
4. The feedback loop: how context evolves
Traditional data engineering follows a linear model: model the data → build pipelines → write documentation → consume. Context is a snapshot. It ages from the moment it is created.
Contextual data engineering replaces the linear model with a feedback loop:
Agent operates (generates SQL, answers a question)
→ User confirms (upvote) or corrects (issue report)
→ Feedback flows back into the context store
→ Context is updated (validated SQL added, rule refined, deprecation noted)
→ Next run is more accurate
→ RepeatEach cycle makes the context marginally stronger. Over weeks of usage, the agent accumulates:
- A growing library of validated SQL patterns it can adapt instead of generating from scratch.
- A set of business rules it will never violate because past violations were corrected.
- An understanding of which tables, columns, and join paths are trusted by the team, based on actual usage rather than documentation claims.
This is not a theoretical model. In Datus's Lakehouse deployment narrative, the feedback loop is the mechanism credited with improving self-service usage and reducing repeated ad-hoc query work. Treat the reported business outcomes as case-specific evidence, not a universal promise: the agent did not get smarter because the LLM improved; it got smarter because the context it operated on improved.
5. Why this matters specifically for data engineering agents
General-purpose coding agents treat every task as a fresh problem. They are brilliant at writing one-off scripts. They are not designed to accumulate domain knowledge about a specific warehouse across months of usage.
Data engineering is different from general-purpose coding in ways that make context accumulation essential:
| General coding task | Data engineering task |
|---|---|
| Write a function to sort a list | Query weekly revenue by region |
| Correctness is objective (does it pass tests?) | Correctness is contextual (which definition of revenue? which FX rate? which test account filter?) |
| Context is the codebase (files, imports, types) | Context is invisible (business definitions, institutional rules, historical SQL patterns) |
| One correct answer | Multiple "correct" answers depending on which team is asking |
A general coding agent can write syntactically perfect SQL. It cannot know that fact_orders.amount_usd is the column the finance team trusts and fact_revenue_v2.revenue_usd is the one operations uses—unless someone has already captured that distinction in a context layer it can retrieve.
This is the practical argument for contextual data engineering: without it, every data engineering agent is a first-day hire. The agent may be brilliant, but it has no institutional knowledge. And data engineering is a discipline where institutional knowledge is the difference between a correct answer and a confidently wrong one.
6. Contextual data engineering vs. documentation, catalogs, and semantic layers
It is worth being precise about what contextual data engineering is not, because the term sits close to several established ideas.
It is not documentation. Documentation is human-authored, human-consumed, and static. It tells a person what a table means. It does not tell an agent, at query time, which of three possible join paths is the authoritative one for region mapping. And it goes stale the week after it is written. Contextual data engineering produces machine-readable, queryable, continuously updated context—documentation is a byproduct, not the product.
It is not a data catalog. A catalog inventories assets. It answers "what tables exist?" Contextual data engineering answers "how do I use this table correctly, right now, given everything the team has learned about it?" Catalogs are the foundation (Layer 1). Contextual data engineering builds on top of them.
It is not a semantic layer—but it reinforces one. A semantic layer defines governed metrics, dimensions, and joins. It is essential infrastructure. But it is static at runtime: defined in Git, deployed on a schedule, consumed by BI tools. Contextual data engineering adds a fast feedback buffer on top: validated ad-hoc SQL, deprecation notes, usage patterns, and corrections that have not yet been promoted to the formal semantic layer. The two are complementary: the semantic layer provides the governed foundation; the context layer captures the institutional knowledge that moves faster than governance cycles.
For more on this relationship, see what is a semantic layer.
7. What a contextual data engineering system looks like in practice
A concrete system that implements contextual data engineering typically includes these components:
A context engine that organizes information along two dimensions: a physical catalog tree (databases → schemas → tables, annotated with semantic models) and a logical subject tree (business domains → topics → metrics, carrying reference SQL and external knowledge). Commands like /gen_semantic_model and /gen_metrics bootstrap context from existing schemas and historical SQL; user feedback refines it over time.
A feedback mechanism that captures signals from actual usage. Every upvote strengthens the pattern. Every correction updates the rules. Every issue report flags something to fix. This is not a separate reporting dashboard—it is built into the agent's normal operation so feedback is captured at the moment it occurs, not in a retrospective meeting.
Delivery units—scoped chatbots called subagents—that package a subset of context (roughly 10 tables, 20 metrics, 30 validated SQL references) for a specific domain. A finance subagent knows finance's definition of revenue. An operations subagent knows operations' definition. Both draw from the same context engine but are constrained to the scope their users need.
Versioning and auditability so teams can trace how context evolved, who changed what, and why. If a metric definition changes in week 12, a team can see what queries were generated against the old definition and whether they need to be re-run.
This architecture is not hypothetical. It is the design behind the open-source Datus agent—an Apache 2.0 licensed data engineering agent built around a persistent Context Engine that treats every successful run as training data for the next one.
8. The competitive dimension: why this is the differentiator
The data engineering agent space in 2026 is splitting along exactly this line. Platform-embedded agents (Google BigQuery DEA, Adobe DEA) lean on their host platform's catalog for Layer 1 context and some Layer 2 semantics, but they are bound to a single environment. Prompt-based agents (Claude Code subagents, Cursor) have no persistent context at all—every session starts from zero.
Between those poles, the agents that will define the category are the ones that own their context layer. They are the ones that can answer not just "what query should I write?" but "given everything this team has learned about this warehouse over the past six months, what is the right query to write?"
This is also why the term "contextual data engineering" matters. It gives a name to the thing that separates a data engineering agent from a chat window. Without the name, the distinction is invisible to buyers who see "data engineering agent" on four different product pages and assume they are the same thing. With the name, the conversation changes: "Does your agent do contextual data engineering—or does it start from zero every time?"
Conclusion
Contextual data engineering addresses the real bottleneck in data work: not compute speed, not LLM quality, but the availability of reliable, reusable, evolvable context at the moment a question is asked. It treats context as a first-class engineering artifact with three layers—schema metadata, business semantics, and institutional memory—and it runs a feedback loop that makes every agent interaction strengthen the context for the next one.
For teams evaluating data engineering agents, the most useful question is not "which agent writes the best SQL?" It is "where does the agent's context live, how does it evolve, and does it outlast a single session?" The answer determines whether you are buying a feature or building an asset.
Try Datus Studio—connect a warehouse, ask a question, and watch the Context Engine build context that compounds.
Frequently asked questions
What is contextual data engineering, in one sentence?
Contextual data engineering is the practice of treating data context—schemas, metrics, validated SQL, business rules, and feedback—as a first-class, evolvable asset rather than a one-time documentation exercise, so that data engineering agents and teams accumulate institutional knowledge instead of rediscovering it on every query.
How is contextual data engineering different from building a semantic layer?
A semantic layer defines governed metrics, dimensions, and join rules—it is the foundation (Layer 2 in the three-layer model). Contextual data engineering adds Layer 3: institutional memory—validated ad-hoc SQL, deprecation notes, feedback history, and business rules that move faster than governance cycles. The two are complementary: the semantic layer provides the governed baseline; contextual data engineering captures the knowledge that accrues between governance cycles and feeds it back into the system.
Do I need contextual data engineering if I already have a data catalog?
A data catalog tells you what tables and columns exist (Layer 1). It does not tell an agent which join path is authoritative, which column was deprecated last month, or which SQL patterns the team has validated in production. Contextual data engineering builds on top of the catalog—it consumes catalog metadata and adds the layers that turn discovery into correct execution.
What does a team need to start practicing contextual data engineering?
Three things: a system that can inspect and store schema context (a context engine), a feedback mechanism that captures corrections and validations from actual usage, and a commitment to treating context as a living asset rather than a documentation deliverable. The first two are tooling; the third is cultural. The cultural piece is the harder one—it requires the team to believe that capturing context during normal work is worth the small upfront cost, because it compounds into dramatically faster work later.
Does this replace my existing data documentation?
No—it makes your documentation more accurate over time. Traditional documentation is a snapshot that ages. Contextual data engineering captures what the team actually does (validated SQL, corrections, usage patterns) and surfaces it as living documentation. The docs become a reflection of reality rather than a record of what was true when someone last had time to update a wiki page.
Related articles
- What is a data engineering agent? — the category definition this article builds on
- What is a semantic layer? — the governed semantics that contextual systems extend
- How a Context Engine improves accuracy — the architecture behind context retrieval and feedback