What Is Apache Hudi? Upserts, Copy-on-Write vs Merge-on-Read & CDC

Q: What is Apache Hudi used for?

Hudi is used to build mutable, incrementally updated data lake tables , most commonly for change-data-capture and streaming ingestion. It applies a continuous stream of inserts, updates, and deletes to lake tables efficiently using a record-level index, tracks every change on a timeline for audit and rollback, and lets downstream jobs read only what changed via incremental queries — workloads that append-only lake formats handle poorly.

Q: What is the difference between Copy-on-Write and Merge-on-Read in Hudi?

Copy-on-Write (CoW) rewrites data files on each update, so reads are fast but writes are heavier — ideal for read-heavy tables with moderate update rates. Merge-on-Read (MoR) writes lightweight delta logs and defers merging to a background compaction, so ingestion is fast and near-real-time but reads must merge base and delta files until compaction runs. CoW optimizes read latency; MoR optimizes write latency for streaming and CDC.

Q: Hudi vs Iceberg — when should I pick Hudi?

Pick Hudi when your dominant workload is high-frequency upserts, CDC ingestion, or incremental processing , where its record-level index and Merge-on-Read design were purpose-built to apply changes cheaply. Choose Iceberg when engine neutrality, broad multi-engine support, and avoiding vendor lock-in matter more than streaming-update performance. Many teams pick one primary format and document exceptions per domain rather than running both everywhere.

Apache Hudi is an open table format and ingestion framework that brings database-style upserts, deletes, and incremental processing to data lakes built on object storage. Where most lake tables were designed to be appended to, Hudi was designed from the start to be mutated: it can apply a stream of changed records to an existing table efficiently, track every change on a timeline, and let downstream jobs read only what changed since the last run. This glossary entry defines what Hudi is, how its two table types work, how record-level upserts and incremental queries differ from append-only formats, how it compares to Iceberg and Delta Lake, and why AI agents querying a Hudi lakehouse need format-aware context.

Disclosure: Datus is a data engineering agent platform. This article explains Apache Hudi as a general concept, referencing Datus alongside other tools and architectures in the category. See the end for more detail.

TL;DR

Apache Hudi is an open table format and incremental data platform that adds record-level upserts, deletes, change tracking, and incremental queries to data lakes on object storage.
Its defining capability is mutability: applying a continuous stream of inserts and updates — for example from CDC — to lake tables without full rewrites, which append-only formats handle poorly.
Hudi offers two table types: Copy-on-Write (CoW) optimizes read performance, while Merge-on-Read (MoR) optimizes write/ingest latency by deferring compaction.
A record-level index lets Hudi locate and update individual records quickly, which is what makes high-frequency upsert workloads practical at lake scale.
Against Iceberg and Delta Lake, Hudi’s sweet spot is streaming ingest and CDC-heavy pipelines; agents querying Hudi must understand table type and query type, or they will read stale or partial data.

1. Apache Hudi: a working definition

Hudi was created at Uber to solve a problem append-only lake formats handled badly: ingesting a high-volume stream of updates, not just new rows. Uber’s data — trips, payments, driver and rider state — changed constantly after first landing, and rewriting entire partitions to apply a handful of updated records was prohibitively expensive. Hudi (the name comes from “Hadoop Upserts Deletes and Incrementals”) was donated to the Apache Software Foundation and is now a top-level project. The original problem still defines its character: Hudi treats a lake table as something you continuously merge changes into, with a managed timeline recording every commit, so the table can be both mutated and audited.

The mechanics follow from that goal. Every write is recorded on the Hudi timeline as an instant with a state, which gives the table a consistent, ordered history and enables rollback and time travel. To make updates cheap, Hudi maintains a record-level index that maps a record key to the file group holding it, so an upsert can rewrite or append to the right file instead of scanning everything. This index is the feature most responsible for Hudi’s reputation in upsert-heavy workloads, and it is the thing most absent from formats that began as append-only and added updates later.

Consider a concrete pipeline. A change-data-capture stream from an operational Postgres delivers orders events — new orders, status changes, cancellations — every few minutes. With an append-only table you would accumulate a log and reconstruct current state with expensive window functions at query time. With Hudi, you define order_id as the record key and continuously upsert the stream into an orders table; the table always reflects current state, late updates land in the right file group via the index, and a deletion propagates as a real delete rather than a tombstone you must remember to filter. The same table also supports an incremental query that returns only records changed since a given commit, so a downstream Silver-layer job processes deltas instead of rescanning the whole table.

A useful working definition:

Apache Hudi is an open table format and incremental processing framework for data lakes that provides record-level upserts and deletes, a managed commit timeline, time travel, and incremental queries over open file formats on object storage. It is built for mutable, continuously updated lake tables — especially streaming and change-data-capture ingestion — exposing files as transactional tables that engines such as Spark, Flink, and Trino can read and write.

That definition draws a few boundaries. Hudi is not only a file format — it ships ingestion tooling (the streaming ingest utility, formerly DeltaStreamer) and table services like compaction and clustering, making it more of an incremental data platform than a passive spec. It is not a query engine — Spark, Flink, Presto/Trino, and Hive execute the SQL. And it is not the right default for purely append-only, read-optimized analytics, where a simpler lakehouse table format may carry less operational overhead.

2. Copy-on-Write vs Merge-on-Read

The single most important Hudi concept to understand before querying or designing a table is its table type, because it determines the trade-off between write latency and read latency. The choice is not cosmetic — it changes how fresh data appears and how much work each query does.

Aspect	Copy-on-Write (CoW)	Merge-on-Read (MoR)
On write	Rewrites affected files with merged data	Writes lightweight delta logs; defers merge
On read	Reads columnar files directly — fast	Merges base files + delta logs at read time
Write latency	Higher	Lower
Read latency	Lower	Higher (until compaction)
Best for	Read-heavy, moderate update rate	High-frequency ingest, streaming
Freshness	Visible after each commit	Near-real-time via delta logs

Copy-on-Write rewrites the data files touched by an update so that every read hits clean, columnar files with no merge cost — excellent for read-heavy tables where updates arrive at a moderate pace. The price is paid on write: applying many small updates means repeatedly rewriting files, which becomes expensive under high-frequency ingestion.

Merge-on-Read inverts the trade-off. Updates are written as compact delta log files alongside the base files, so ingestion is fast and near-real-time, but readers must merge base and delta files on the fly until a background compaction folds the logs into new base files. MoR is the natural choice for streaming and CDC pipelines where ingest latency matters most, provided the team actually runs compaction on a schedule rather than letting delta logs pile up. Choosing CoW for a high-velocity CDC stream — or MoR without scheduled compaction — is a common, self-inflicted performance problem, which is exactly why the table type belongs in any documentation an analyst or agent consults.

3. Record-level upserts and incremental queries

Two capabilities distinguish Hudi from append-first formats in day-to-day use: record-level writes and incremental reads. Understanding them is the difference between using Hudi as intended and fighting it.

On the write side, Hudi’s record-level index is what makes upserts and deletes efficient. Each record has a key, and the index knows which file group holds that key, so applying an update touches only the relevant files rather than rewriting a whole partition. This is why Hudi handles late-arriving and mutating data — corrections, status transitions, GDPR-style deletes — without the partition-rewrite tax that append-only formats incur when forced to update. Deletes are first-class: a delete removes the record, which matters for compliance workflows where a tombstone you must remember to filter is not acceptable.

On the read side, incremental queries let a consumer ask “give me only the records that changed since commit X.” Instead of rescanning an entire table to find what is new, a downstream job pulls the delta and processes it — the lake-native equivalent of reading a change feed. This composes naturally with medallion architecture: a Bronze table ingests raw CDC, and a Silver job incrementally consumes only changed Bronze records to maintain conformed tables, cutting compute dramatically versus full reprocessing. The official Apache Hudi documentation details the query types, and the distinction between a snapshot query (current state) and an incremental query (changes only) is one an agent must respect to avoid double-counting or missing rows.

4. Hudi vs Iceberg vs Delta Lake

Hudi sits alongside two other dominant open table formats — Apache Iceberg and Delta Lake — and the comparison is less about who has which feature — they have converged substantially — and more about which workload each was built around. The table below answers which format leans toward which problem.

Dimension	Apache Hudi	Apache Iceberg	Delta Lake
Origin / steward	Uber → Apache	Netflix → Apache	Databricks → Linux Foundation
Core design bet	Upserts, CDC, incremental	Engine-neutral metadata	Spark/Databricks integration
Update model	CoW and MoR, record-level index	CoW / MoR, hidden partitioning	CoW; deletion vectors
Incremental reads	First-class incremental queries	Snapshot-based incremental	Change Data Feed
Sweet spot	Streaming ingest, mutable tables	Multi-engine analytics, vendor exit	Databricks-centric lakehouse
Engine breadth	Spark, Flink, Hive, Presto/Trino	Broadest	Strongest in Spark/Databricks

Read this by sweet spot, because feature-by-feature the three formats keep narrowing the gaps. Hudi’s strongest argument is mutable, high-frequency ingestion: if your central problem is applying a relentless CDC stream to lake tables with low latency and reading only deltas downstream, Hudi’s record-level index and MoR design target exactly that. Apache Iceberg is the safer default when engine neutrality and avoiding vendor lock-in dominate, given its broad engine support. Delta Lake is hard to beat inside the Spark and Databricks ecosystem where its integration is deepest.

The honest framing is that there is no universal winner, and a team optimizing purely for read-heavy, multi-engine analytics may find Iceberg simpler to operate than Hudi, whose richer ingest machinery carries more moving parts. The right move is to match the format to the dominant workload, write the decision down, and make it discoverable — because implicit format choices are where both humans and agents go wrong.

5. Hudi and AI agents: format-aware context

Hudi’s mutability is powerful for pipelines but raises the bar for AI agents and text-to-SQL systems, because correct results depend on concepts a column list never reveals: table type, query type, and the meaning of “current.” An agent that knows table and column names but nothing about MoR semantics will confidently return stale or partial data.

Context gap	Agent symptom	What it needs injected
Table type unknown	Assumes CoW freshness on a MoR table	Table type (CoW/MoR) and compaction status
Query type confusion	Uses snapshot read when an incremental delta was intended	Query-type policy (snapshot vs incremental)
Record key blindness	Re-aggregates a log instead of reading current state	Record key and precombine field semantics
Partition + index	Full scan, ignores record-level index	Partition spec and indexed key documentation
Engine divergence	Spark-only SQL on a Trino/Hive reader	Engine-scoped dialect and supported operations

This is why schema linking on a Hudi estate is harder than on a plain relational database: the same table answers differently depending on whether you read it as a snapshot or incrementally, and on whether compaction has run. A data engineering agent needs evolvable context that captures table type and query semantics, not a static DDL dump that says nothing about how the table is meant to be read.

Open-source agents aimed at cross-stack work address this by reading lake tables through a catalog and carrying table-format metadata into the model’s context. As of June 2026, Datus includes a Spark adapter (v0.2.6) that connects to Hudi and Delta Lake tables and builds a catalog-backed context tree, so generated SQL references governed tables and their semantics rather than guessing at raw files. That is one implementation pattern among several; the general requirement — context that reflects how a mutable table is actually read — holds regardless of tool, and acknowledging that a managed Databricks or warehouse path may suit some teams better is part of choosing honestly.

6. Production-readiness checklist for Hudi + agents

Before pointing analysts or an agent at Hudi tables, a short checklist predicts most failures, because Hudi’s flexibility is also its sharpest edge.

Table type documented — Is each table CoW or MoR, and is the choice justified by its workload?
Compaction scheduled — For MoR tables, does compaction run on a schedule so reads do not degrade?
Record keys and precombine — Are record keys and the precombine (ordering) field documented per table?
Query-type policy — When should consumers use snapshot vs incremental queries?
Catalog registration — Are tables registered with owners, partitions, and physical/logical mappings?
Reference SQL — Are vetted queries tagged by engine and query type so agents imitate correct patterns?

Missing items 1–2 produce the stale-read and slow-query failures most often blamed on “the lakehouse,” while missing items 3–6 are where agents specifically generate wrong results. The list is engine-agnostic by design — it holds whether your compute is Spark, Flink, or Trino.

Conclusion

Apache Hudi is best understood as the open table format that made data lakes mutable: record-level upserts and deletes, a managed timeline for audit and rollback, and incremental queries that turn a lake table into a change feed. Its defining bet against Iceberg and Delta Lake is streaming and CDC ingestion, anchored by a record-level index and the Copy-on-Write versus Merge-on-Read trade-off. For AI agents, Hudi moves the hard problems from syntax to table-type, query-type, and freshness context — the same shift that separates a demo from a production-grade data engineering agent.

Frequently asked questions

What is Apache Hudi used for?

Hudi is used to build mutable, incrementally updated data lake tables, most commonly for change-data-capture and streaming ingestion. It applies a continuous stream of inserts, updates, and deletes to lake tables efficiently using a record-level index, tracks every change on a timeline for audit and rollback, and lets downstream jobs read only what changed via incremental queries — workloads that append-only lake formats handle poorly.

What is the difference between Copy-on-Write and Merge-on-Read in Hudi?

Copy-on-Write (CoW) rewrites data files on each update, so reads are fast but writes are heavier — ideal for read-heavy tables with moderate update rates. Merge-on-Read (MoR) writes lightweight delta logs and defers merging to a background compaction, so ingestion is fast and near-real-time but reads must merge base and delta files until compaction runs. CoW optimizes read latency; MoR optimizes write latency for streaming and CDC.

Hudi vs Iceberg — when should I pick Hudi?

Pick Hudi when your dominant workload is high-frequency upserts, CDC ingestion, or incremental processing, where its record-level index and Merge-on-Read design were purpose-built to apply changes cheaply. Choose Iceberg when engine neutrality, broad multi-engine support, and avoiding vendor lock-in matter more than streaming-update performance. Many teams pick one primary format and document exceptions per domain rather than running both everywhere.

Can a text-to-SQL agent query Hudi tables reliably?

Only with format-aware context. On a small POC, pasting DDL may work, but on production Hudi tables an agent must know each table’s type (CoW or MoR), the intended query type (snapshot vs incremental), and the record key, or it will return stale or partial results that look valid. Supply table-type and query-type metadata through a catalog before relying on agent-generated SQL.

What is Apache Iceberg? — the engine-neutral table format Hudi is often compared to
What is a lakehouse? — the architecture Hudi tables make possible
What is a data engineering agent? — why table-format-aware context matters

Disclosure: Datus is a data engineering agent platform. This glossary entry explains Apache Hudi as a general concept and how cross-stack agents approach lakehouse context — alongside other tools and architectures in the category.