Skip to content

What Is Text-to-SQL? Definition, How It Works & Why Context Matters

Text-to-SQL is the task of turning a natural-language question into an executable SQL query against a real database — the core generation step inside most NL2SQL products and data engineering agents. This glossary entry defines how text-to-SQL systems work, where they fail without grounded context, and how they differ from a one-shot SQL copilot that forgets every session.

Disclosure: Datus is a data engineering agent platform. This article explains text-to-SQL as a general concept, referencing Datus alongside other tools and architectures in the category. See the end for more detail.

TL;DR

  • Text-to-SQL maps questions like "weekly net revenue by region" into SELECT statements grounded in a specific schema — not generic SQL tutorials.
  • Modern systems use LLMs plus schema context, few-shot examples, and sometimes semantic layers — accuracy depends more on context quality than model size.
  • NL2SQL is the product category name; text-to-SQL is the underlying capability. Many "agents" are text-to-SQL with extra orchestration.
  • Failure modes cluster around schema linking (wrong columns), grain errors (wrong joins), and stale definitions (metrics that changed in production but not in metadata).
  • Treating text-to-SQL as a persistent capability — where corrections survive sessions and context improves with usage — separates durable infrastructure from stateless chat.

1. Text-to-SQL: a working definition

Text-to-SQL answers a narrow but high-value question: given a database (or warehouse) and a user question in plain language, produce SQL that runs correctly and returns what the user meant.

Consider a concrete example. A marketing analyst asks "what was net revenue by channel last quarter, excluding test accounts?" A correct text-to-SQL system must do more than parse the words. It must know that "net revenue" maps to fact_orders.net_revenue_usd (not gross_amount), that "channel" means dim_marketing_channel.name (not dim_sales_channel), that "last quarter" requires the fiscal calendar table (fiscal_calendar not calendar_date), and that "excluding test accounts" translates to WHERE account_type != 'TEST'. A model that gets any of these four mappings wrong will produce fluent, syntactically correct SQL that returns numbers — wrong numbers.

The gap between "correct SQL" and "correct answer" is the central puzzle text-to-SQL systems must solve. And it compounds with warehouse maturity. A two-year-old startup with 80 tables managed by one analytics engineer can sustain accuracy with careful prompting alone — the schema is small enough to dump into the prompt, and the one person who knows every table is usually in the room. A ten-year-old company with 3,000 tables, 15 data-producing teams, and five different definitions of "active user" faces a different problem entirely: the system must disambiguate which team the question came from, which definition applies, and which of three possible join paths matches the intended grain. This is why "just use a bigger model" advice breaks down at scale — no model can disambiguate five conflicting definitions of the same term without context that tells it which one matters in this moment.

A useful working definition:

Text-to-SQL is software that accepts natural-language intent plus database-grounding context (schemas, metrics, examples, policies) and outputs executable SQL against that environment.

That definition excludes pure SQL tutors that never touch your tables. It also excludes copilots that generate plausible SQL without credentials or side effects — those are drafting assistants, not production text-to-SQL.

The pipeline most teams implement — explicitly or inside a vendor product — has four stages:

StageJobTypical failure
Intent parsingUnderstand question, time range, filtersAmbiguous "last week" or "revenue"
Schema linkingMap terms to tables and columnsPicks amount instead of amount_usd_net
SQL synthesisBuild joins, aggregates, filtersWrong grain (user-level vs order-level)
Execution & validationRun query, check results, iterateSilent wrong numbers that "look reasonable"

The four-stage decomposition matters because different failure modes demand different fixes. Intent parsing errors need better disambiguation prompts or user clarification loops. Schema linking errors — the dominant failure category in production — need richer metadata, aliases, and reference SQL, not a bigger language model. Synthesis errors need structural constraints like join authority graphs. Validation errors need execution loops and result inspection. A team that treats all four failure categories as "the model got it wrong" will spend cycles on model upgrades when the real bottleneck lives in stages 2 and 4.

For a deep dive on the second stage, see what is schema linking.

2. Text-to-SQL vs NL2SQL vs SQL copilot

These terms overlap in marketing but differ in scope:

TermMeaning
Text-to-SQLThe task — NL in, SQL out
NL2SQLThe product category — tools and research benchmarks built around text-to-SQL
SQL copilotIDE or chat assist that suggests SQL; often no persistent schema memory or execution loop
Data engineering agentBroader system: text-to-SQL plus context persistence, workflows, feedback, and multi-step data work

A SQL copilot can produce excellent one-off queries. A data engineering agent treats text-to-SQL as a repeatable production capability — the same question next month should benefit from corrections made last month.

The distinction becomes operational quickly. If a VP of Sales asks "month-over-month pipeline growth by region" and gets the right answer in March, she expects the same question in April to produce a comparable answer — not a different SQL that happened to feel right to the model that day. Copilots optimized for assistive drafting don't carry corrections forward. Agents optimized for repeatable answers must persist every validated mapping, join path, and filter as state. This persistence requirement — not model quality — is what separates the two categories in production.

For how agents differ from copilots at the product level, read data engineering agent vs. SQL copilot.

3. What context text-to-SQL systems need

Benchmarks like Spider and BIRD popularized text-to-SQL research on fixed schemas. Production warehouses are messier: renamed columns, three join paths, tribal knowledge about test accounts, and metrics that exist only in Slack threads.

Minimum context for reliable text-to-SQL:

Context typeWhy it matters
Physical schemaTable and column names, types, keys
Semantic definitionsWhat "net revenue" means — filters, grain, currency
Reference SQLQueries that were validated in production
Business rulesDeprecations, PII boundaries, "never use table X"
Feedback historyCorrections from prior runs

Without semantics, models guess from column names — amt_usd_net_v2 is not self-explanatory. A semantic layer supplies governed metrics; a context engine adds validated ad-hoc SQL and usage-based corrections on top.

The hierarchy matters here. A physical schema tells the model fact_orders has a column called amount_usd. A semantic definition tells the model that "net revenue" means SUM(amount_usd) WHERE status != 'refunded' AND account_type != 'TEST'. Reference SQL tells the model that last week's CFO asked a similar question and the answer used fiscal_calendar_v2, not calendar_date. Feedback history tells the model that three months ago, Marketing complained that "channel" mapped to dim_sales_channel instead of dim_marketing_channel, and the correction was recorded. Each layer of context addresses failure modes the layer below cannot catch.

Retrieval-augmented generation (RAG) is how many systems inject that context at query time. See RAG for data engineering.

4. How accuracy is measured (and why benchmarks mislead)

Research metrics include execution accuracy (does the query run?) and result accuracy (does the result match gold?). Production teams care about additional dimensions:

  • Semantic correctness — numbers match the business definition of the metric
  • Schema usage — query touches the right tables, not convenient shortcuts
  • Safety — no accidental full-table scans or PII exposure

Leaderboard scores on clean academic schemas rarely predict behavior on a ten-year-old warehouse with undocumented views. The gap is almost always context, not parameter count.

This gap is quantitative. The Spider benchmark, which drove much of the NL2SQL research resurgence, uses ~200 databases averaging 27 tables each with clean, academic schemas — column names like student_id and course_name that map trivially to natural-language questions. A production warehouse at a midsize SaaS company might have 2,000+ tables, 40,000+ columns, and naming conventions that evolved across three CTOs. Column names like rev_v3_final_2 carry zero semantic signal. Even if the Spider leaderboard reports 85% execution accuracy for the top model, that score was earned on schemas where song_name unambiguously means a song's name. Drop the same model on fact_orders_v2.am_usd_n with no aliases, and execution accuracy often falls below 40%.

The "last mile" from benchmark accuracy to production accuracy is bridged entirely by context infrastructure — how you organize, retrieve, and feed schema metadata and business definitions into the generation prompt. Teams that benchmark-shop for the highest Spider score while neglecting their metadata hygiene are optimizing the wrong variable.

Production systems track evaluation signals — exact SQL match, result count, schema usage, semantic correctness — and feed corrections back into context through a feedback loop, so text-to-SQL improves with usage rather than only with model upgrades.

5. Common text-to-SQL architectures in 2026

Prompt + schema dump

The baseline: paste CREATE TABLE statements or JSON schema into the prompt. Fast to prototype; breaks on large warehouses and stale docs. This architecture treats every question as a fresh encounter — no memory of what worked last time, no understanding of which tables are deprecated, and no mechanism to learn from mistakes. An engineer who prototypes text-to-SQL on a 20-table staging database and declares success will discover the architecture collapses at 200 tables, when the prompt exceeds the model's context window and the model's attention dilutes across irrelevant schema fragments.

Semantic layer–grounded generation

Tools connect to MetricFlow, Cube, or LookML so generation targets certified metrics. Strong for governed KPIs; weak for knowledge that never became a formal metric. This approach excels when the question fits neatly into a pre-modeled metric — "monthly recurring revenue" or "customer churn rate" — where the semantic layer has already encoded the correct calculation logic. It fails when an analyst asks "what percentage of trial users who completed onboarding in Q1 converted to paid by end of Q2," a query that spans three domains and requires join logic no single metric definition captures.

RAG over metadata and SQL history

Retrieve relevant tables, metric docs, and past queries via embeddings or keyword search, then generate. Strength: scales to large estates; weakness: retrieval must surface the right snippets — garbage context produces confident wrong SQL. The retrieval quality problem is subtle: an embedding model may associate "revenue by product line" with prod_line in the ERP schema when the finance team's actual definition uses dim_product_family in the warehouse. The model gets confident wrong context and generates confident wrong SQL — a worse outcome than admitting ignorance.

Agent loops with tools

The model plans, calls list_tables, run_sql, inspects errors, retries. Closer to how engineers work. Cost and latency rise; governance must constrain which tools and tables are in scope. The agent-loop architecture introduces a new failure mode that none of the simpler architectures face: cascading errors. If the first run_sql call uses a wrong join path and returns plausible-looking numbers, the model may treat those numbers as confirmation and deepen the wrong approach in subsequent iterations rather than backtracking. Mitigation requires execution guardrails — result shape checks, cardinality assertions, and diff comparisons against cached correct results — that add complexity to the agent loop but are necessary for production reliability.

Datus combines catalog and subject trees, vector retrieval over reference SQL, semantic model generation commands, and scoped Subagents so text-to-SQL runs inside a bounded, evolvable context — not a global schema dump.

6. Text-to-SQL inside a data engineering agent

Text-to-SQL alone does not make an agent. Agents add:

  • Persistence — corrections survive the session. When an analyst corrects "revenue" from gross_amount to net_revenue_usd, that mapping is stored and applied to every future question in the domain.
  • Delivery — domain-scoped interfaces package context for analysts or APIs, rather than forcing every user through the same global prompt.
  • Integration — MCP clients, orchestrators, CI workflows. Text-to-SQL embedded in a dbt pull request can validate that a model change does not break downstream SQL.
  • Governance — scoped tables, rules, audit trails (especially in enterprise deployments). Every query leaves a trace from question to SQL to tables used to result.

Contextual data engineering describes the operating model: every text-to-SQL run strengthens context for the next run. In practice, this means the system's accuracy graph slopes upward over time within a domain — not because the model was upgraded, but because every correction is stored and every validated query is indexed as a retrieval candidate for future questions. A stateless copilot's accuracy graph is flat forever.

7. A text-to-SQL failure, step by step

The best way to understand where text-to-SQL breaks is to trace a real-looking failure from question to wrong answer and then to correct answer.

The question. A product manager at a SaaS company asks: "month-over-month growth in active users by plan tier for Q1 2026." She sends this to the company's text-to-SQL system.

What the model generates — wrong answer. The SQL joins dim_user to fact_event on user_id, filters for event_name = 'login', groups by dim_plan.tier_name, and computes month-over-month growth. The numbers look plausible: a 6% monthly growth rate aligned with what the product team expects from recent feature launches. But the numbers are wrong. The query counts every login event, meaning a single user who logs in five times in a month contributes five to the "active users" count. The product manager wanted unique users who logged in at least once — a different grain.

Why it failed. The failure traces to stages 2 and 3 of the pipeline. Stage 2 (schema linking) correctly mapped "active users" to the fact_event table — the events table does contain login activity. But it did not retrieve the company's business rule that "active users" means COUNT(DISTINCT user_id) at the monthly grain, not COUNT(*) over events. Stage 3 (SQL synthesis) then compounded the error by building a COUNT(*) aggregation over fact_event without applying a DISTINCT. The error is semantic, not syntactic — the SQL runs perfectly, returns a number, and that number is plausible but wrong.

What context would have fixed it. Three pieces of context, if retrieved and injected into the generation prompt, would have prevented this failure. First, the company's semantic layer defines "monthly active users" as COUNT(DISTINCT dim_user.user_id) WHERE event_name = 'login' grouped by calendar month — a definition that encodes the DISTINCT requirement. Second, a reference SQL query from last quarter's board deck computed MAU growth using exactly this logic and was validated by the CFO's team. Third, a feedback note from three months ago records that a product analyst corrected a similar query for using COUNT(*) instead of COUNT(DISTINCT user_id). Any one of these three pieces, if surfaced by retrieval, would have guided the model to the correct aggregation. None were surfaced — because the company's RAG index was not configured to retrieve semantic layer definitions alongside schema metadata.

The correct answer. The corrected SQL uses COUNT(DISTINCT dim_user.user_id) over fact_event joined to dim_plan, grouped by DATE_TRUNC('month', event_date) and plan_tier_name, with a month-over-month growth calculation in a CTE. The result shows 3.2% monthly growth — nearly half the original 6% estimate. The original estimate's plausibility made it dangerous: a number aligned with expectations, generated by fluent SQL, that no one questioned until a quarterly audit.

This failure pattern — correct syntax, plausible numbers, wrong aggregation grain — accounts for roughly 40% of text-to-SQL production errors across deployments. It cannot be fixed by a bigger model because the model already wrote syntactically perfect SQL. It can only be fixed by better context retrieval that surfaces the business definition, the reference example, or the past correction — any one of which would have steered the model to DISTINCT.

8. When text-to-SQL is enough — and when it is not

Text-to-SQL–first tools fit when:

  • Schema is small and stable (fewer than ~100 tables, infrequent migrations)
  • Questions are mostly ad-hoc SELECTs from a single knowledgeable user
  • One power user owns all corrections informally

You outgrow text-to-SQL–only products when:

  • Multiple teams define the same metric differently and the system has no mechanism to serve different definitions by domain
  • Agents must share context across Chat, CLI, and API — three entry points that a stateless generator cannot synchronize
  • Enterprise buyers require traceability from question → SQL → tables used, a compliance requirement that session-scoped tools cannot satisfy
  • Corrections made in one session must survive to the next session without manual re-entry

The upgrade path from NL2SQL demo to data engineering agent platform occurs when the organization stops asking "can the model generate SQL?" and starts asking "can the system remember what we corrected last quarter?" The first question is answered by every text-to-SQL product on the market. The second question is answered by the products that treat context as persistent infrastructure rather than as a disposable prompt prefix.

This transition matters for procurement in particular. A team evaluating text-to-SQL tools in a one-hour demo — asking ten questions against a demo schema — will see near-identical performance across products. The differences only surface over weeks of real usage, when corrections pile up, schemas drift, and cross-team metric conflicts emerge. Evaluating text-to-SQL on a single session is like evaluating a database by running ten queries without checking whether writes persist after restart.

Conclusion

Text-to-SQL is the mechanism that turns questions into queries. Reliability comes from schema linking, semantic definitions, reference SQL, and feedback — not from swapping LLMs alone. Treat text-to-SQL as a capability to embed in governed, evolvable context, and it becomes durable infrastructure. Treat it as stateless chat, and you rebuild the same wrong query every quarter.

Frequently asked questions

How do I know if my text-to-SQL setup is production-ready?

Test it on ten questions your team actually asked last month — not benchmark queries. If the model gets correct answers on fewer than seven without human correction, your context infrastructure (schemas, metrics, reference SQL) is the bottleneck, not the model. The fastest signal: pick a question a VP asked six months ago that was correctly answered in a dashboard, ask your text-to-SQL system the same question, and check whether the answer matches. If it does not, trace the discrepancy to whether the model picked the wrong column, wrong join path, wrong filter, or wrong grain — that trace points directly at which context layer is missing.

Can text-to-SQL work without a semantic layer?

Yes for exploratory schemas with a single knowledgeable user who catches errors informally. But for organization-wide KPI consistency — where Marketing, Finance, and Product all query the same warehouse and expect "net revenue" to mean the same thing every time — a semantic layer is the only scalable way to encode metric governance. The compromise path: start with a minimal semantic layer covering the top 20 KPIs that cross team boundaries, and let the text-to-SQL system's feedback loop handle ad-hoc knowledge for everything else. Over time, promote frequently validated ad-hoc queries into the semantic layer.

Why do text-to-SQL tools fail on real warehouses?

Most failures are context failures — wrong column linked, wrong join grain, or outdated metric logic — not inability to write syntactically valid SQL. The model almost always produces correct SELECT ... FROM ... WHERE syntax. It just selects from the wrong table or filters with the wrong condition. Fixing this requires investment in metadata quality, reference SQL libraries, and feedback loops — the unglamorous infrastructure work that is harder to demo but determines whether the system is trustworthy or just impressive-looking.

Can text-to-SQL handle complex SQL features like window functions, CTEs, and subqueries?

Yes, modern LLM-based text-to-SQL systems can generate window functions (ROW_NUMBER() OVER (PARTITION BY ...)), CTEs (WITH monthly_totals AS (...)), and nested subqueries. The model's ability to produce complex SQL syntax is rarely the bottleneck — most leading models can handle advanced SQL features with high syntactic accuracy. The bottleneck, as with simple queries, is whether the system has enough context to know which window function, which partitioning key, and which CTE structure matches the business intent. A model can fluently write LAG(revenue, 1) OVER (PARTITION BY region ORDER BY month) — but if the business definition of month-over-month growth uses fiscal_month rather than calendar_month, the syntactically perfect window function computes the wrong comparison.

What's the difference between text-to-SQL and asking ChatGPT to write SQL?

ChatGPT generates SQL from its training data — it knows what SELECT syntax looks like and can produce plausible queries for well-known schemas like the Microsoft AdventureWorks sample. It has no access to your actual tables, no knowledge of your metric definitions, and no memory of what worked last time. The generated SQL may look correct and may even run if you copy it into your environment, but it was written without grounding in your schema, your business rules, or your data. Text-to-SQL tools, at minimum, inject your real schema into the prompt. Agent-grade text-to-SQL adds persistent context — validated SQL, metric definitions, and feedback — so the system knows not just what your tables are named but what your data actually means.


Disclosure: Datus is a data engineering agent platform. This glossary entry explains text-to-SQL as a general concept and how Datus approaches it — alongside other tools and architectures in the category.

Built with VitePress