Deletes in Apache Iceberg
Apache Iceberg has a few ways to handle deletes and updates. This short post expands on the previous introduction to the Iceberg format spec and focuses on deletes.
Here are the three ways to handle deletes:
Copy-on-writeis the simplest. It creates a new file that excludes deleted data, removes the old file, and adds the new file in a manifest update.Equality deletesadd a delete file with values of the equality-id columns. Deleted rows are filtered out on read.V3 deletion vectorsadd positions of deleted rows to a roaring bitmap file. Deleted rows are filtered out on read.- (V2 positional deletes used a different format, this post does not cover that.)
Through a series of examples, you'll learn how Iceberg represents deletes in metadata files.
Let's explore 2 and 3 in detail in this post.
Equality Deletes
Let's start with the following table schema:
- id: int (required) - name: string (required) - value: double (required) - timestamp: timestamp (required)
Example table rows (4 rows, ids 1-4):
| id | name | value | timestamp |
|---|---|---|---|
| 1 | User-1 | 100.0 | 2024-12-03 08:15:22 |
| 2 | User-2 | 200.0 | 2024-12-07 14:32:45 |
| 3 | User-3 | 300.0 | 2024-12-15 19:08:11 |
| 4 | User-4 | 400.0 | 2024-12-21 23:55:30 |
Let's delete ids 1 and 3. The manifest file after the delete has two snapshots: one that added 4 rows and one that deleted 2 rows.
format-version: 2
table-uuid: dc7237f8-f426-4907-a579-be12807dab78
location: demo_output/iceberg/default/users_equality_delete_demo
last-sequence-number: 2
current-snapshot-id: 6000271643905118120
snapshots:
- sequence-number: 1
snapshot-id: 346476157697557815
summary:
operation: append
added-data-files: '1'
added-records: '4'
- sequence-number: 2
snapshot-id: 6000271643905118120
summary:
operation: delete
added-equality-delete-files: '1'
added-delete-files: '1'
added-equality-deletes: '2'
total-delete-files: '1'
Manifest list for the delete snapshot has two manifests — previous data manifest (content = 0), and a new delete manifest (content = 1).
records:
- content: 0
manifest_path: demo_output/iceberg/default/users_equality_delete_demo/metadata/95018c92-07d2-4d88-b659-e2c870e6aa38-m0.avro
added_rows_count: 4
- content: 1
manifest_path: demo_output/iceberg/default/users_equality_delete_demo/metadata/0ddb6c10-afb2-4eba-b647-6d8ab8fbadc8-m0.avro
added_rows_count: 2
Delete manifest (equality delete file entry) looks like this. A few things to note:
equality_ids: column id of columns used for deletion. It is theid (== 1)column in this example.content = 2indicates equality deletes. (This is thecontentin manifest, not manifest list.)- The
.parquetdelete file has ids 1 and 3.
records:
- data_file:
content: 2
file_format: PARQUET
file_path: demo_output/iceberg/default/users_equality_delete_demo/data/eq-delete-db033e4b-e1ae-434d-8f4d-337ddb3667d3.parquet
equality_ids:
- 1
record_count: 2
lower_bounds:
- key: 1
value: AQAAAA==
upper_bounds:
- key: 1
value: AwAAAA==
snapshot_id: 6000271643905118120
status: 1
Updates with equality deletes
An UPDATE is a delete + insert. Both the equality delete file and the new data file are written at the same sequence number (seq=3):
| seq | file type | content |
|---|---|---|
| 1 | data file | id=1,2,3,4 |
| 3 | equality delete | id=1 |
| 3 | data file | id=1, value=999 |
id=1,2,3,4.id=1.id=1, value=999.S suppresses rows where seq_written < S.1 < 3), new row is preserved (3 < 3 is false).write.delete.mode property
write.delete.mode is a table property hint. It tells writers whether to:
copy-on-write(default for V1/V2): rewrite the entire data file, excluding deleted rows.merge-on-read: write a sidecar delete file (equality delete for V2, DV for V3), leave data file intact.
Writers can override this hint on a per-operation basis. The property sets the default trade-off: fast writes (merge-on-read) vs. fast reads (copy-on-write).
V3 Deletion Vector
Positional deletes mark the set of (file name, position)s to be deleted. Readers skip those rows on read. Deletion vectors use a roaring bitmap to encode these positions efficiently.
Let's explore this by creating a V3 table and deleting the same rows. This produces one Puffin DV file and references it in the delete manifest.
format-version: 3
properties:
write.delete.mode: merge-on-read
current-snapshot-id: 505169285673218063
snapshots:
- sequence-number: 1
summary:
operation: append
- sequence-number: 2
summary:
operation: delete
added-delete-files: '1'
added-dvs: '1'
added-position-deletes: '2'
The following delete-manifest entry references a Puffin file and identifies the exact data file it applies to. A few fields to note:
referenced_data_file: exact data file this DV applies to.content_offset: 4: byte offset into the Puffin file where this DV blob starts. This exists because a single Puffin file can contain multiple deletion vectors.content_size_in_bytes: 44: byte length of the DV blob.
records:
- data_file:
content: 1
file_format: PUFFIN
file_path: demo_output/iceberg/default/users_v3_deletion_vector_demo/data/00001-1-89bef231-4cbd-4b07-9dea-b5596102da4c-00001.puffin
referenced_data_file: demo_output/iceberg/default/users_v3_deletion_vector_demo/data/seed-rows-1-4-b08d704d-aec2-414c-8b5e-d2d041784b2e.parquet
content_offset: 4
content_size_in_bytes: 44
record_count: 2
snapshot_id: 505169285673218063
status: 1
Deletion vectors are stored as Roaring Bitmaps inside Puffin files. Iceberg uses this format because bitmaps are compact and support fast membership checks and merges. For more detail, see this overview of roaring bitmaps.
Writing Deletion Vectors
DVs require read-before-write. The writer must scan data files to discover row positions before building the bitmap. For DELETE WHERE id > 100, the engine first finds matching row ordinals, then writes the DV.
Each worker handles a disjoint set of data files independently:
Worker 1 → scans data_file_A → positions {0, 2} → writes dv_A.puffin
Worker 2 → scans data_file_B → positions {5, 9} → writes dv_B.puffin
Worker 3 → scans data_file_C → no matches → writes nothing
Each worker writes a different Puffin file that includes previous deletes, if any. In this demo flow, that yields one current deletion vector per data file. A coordinator then collects produced DVs and commits them in one snapshot.
Reading Deletion Vectors
Unlike equality deletes (which use sequence numbers), deletion vectors link directly to a data file. Each manifest entry carries referenced_data_file, content_offset, and content_size_in_bytes, so readers can locate the exact bitmap blob without scanning unrelated files.
Contains referenced_data_file, content_offset, and content_size_in_bytes.
Reader seeks to byte offset and reads exactly the DV bitmap bytes.
Rows are scanned and positions present in bitmap are skipped.
content=1, file_format=PUFFIN).content_offset in Puffin and read content_size_in_bytes bytes to decode the roaring bitmap.referenced_data_file, iterate rows, and skip rows whose ordinal position is set in the bitmap.Streaming and Deletes
Equality deletes are simpler for writers — they can mark content as deleted without knowing the row's position in a data file.
One common streaming pattern, followed by Flink is:
- Maintains an
id -> (file, position)map in memory for the rows it writes. - On delete, uses the position from this map to write a positional delete.
- Falls back to equality delete if the mapping doesn't exist.
As noted earlier, equality deletes can add reader overhead. Improving positional-delete support in streaming engines is one of the goals of the secondary indexes proposal (discussion as of May 2026).
RisingWave, a streaming database handles deletion and compaction with two branches: ingestion writes continuously to an ingestion branch (likely with equality deletes), and periodic compaction coalesces delete files, consolidates data files on the main branch to improve read performance.
Conclusion
We covered deletes in Iceberg. Merge on read schemes are efficient for writers, but they shift work to readers.
If you want to inspect these flows in detail, or ask additional questions, try out TableFormatsExplorer!