Iceberg

Deletes in Apache Iceberg

Apache Iceberg has a few ways to handle deletes and updates. This short post expands on the previous introduction to the Iceberg format spec and focuses on deletes.

Here are the three ways to handle deletes:

  1. Copy-on-write is the simplest. It creates a new file that excludes deleted data, removes the old file, and adds the new file in a manifest update.
  2. Equality deletes add a delete file with values of the equality-id columns. Deleted rows are filtered out on read.
  3. V3 deletion vectors add positions of deleted rows to a roaring bitmap file. Deleted rows are filtered out on read.
    • (V2 positional deletes used a different format, this post does not cover that.)

Through a series of examples, you'll learn how Iceberg represents deletes in metadata files.

Let's explore 2 and 3 in detail in this post.

Equality Deletes

Let's start with the following table schema:

Table schema
- id: int (required)
- name: string (required)
- value: double (required)
- timestamp: timestamp (required)

Example table rows (4 rows, ids 1-4):

idnamevaluetimestamp
1User-1100.02024-12-03 08:15:22
2User-2200.02024-12-07 14:32:45
3User-3300.02024-12-15 19:08:11
4User-4400.02024-12-21 23:55:30

Let's delete ids 1 and 3. The manifest file after the delete has two snapshots: one that added 4 rows and one that deleted 2 rows.

Table metadata after equality delete
format-version: 2
table-uuid: dc7237f8-f426-4907-a579-be12807dab78
location: demo_output/iceberg/default/users_equality_delete_demo
last-sequence-number: 2
current-snapshot-id: 6000271643905118120
snapshots:
- sequence-number: 1
    snapshot-id: 346476157697557815
    summary:
        operation: append
        added-data-files: '1'
        added-records: '4'
- sequence-number: 2
    snapshot-id: 6000271643905118120
    summary:
        operation: delete
        added-equality-delete-files: '1'
        added-delete-files: '1'
        added-equality-deletes: '2'
        total-delete-files: '1'

Manifest list for the delete snapshot has two manifests — previous data manifest (content = 0), and a new delete manifest (content = 1).

Manifest list entries for delete snapshot
records:
- content: 0
  manifest_path: demo_output/iceberg/default/users_equality_delete_demo/metadata/95018c92-07d2-4d88-b659-e2c870e6aa38-m0.avro
    added_rows_count: 4
- content: 1
  manifest_path: demo_output/iceberg/default/users_equality_delete_demo/metadata/0ddb6c10-afb2-4eba-b647-6d8ab8fbadc8-m0.avro
    added_rows_count: 2

Delete manifest (equality delete file entry) looks like this. A few things to note:

  1. equality_ids: column id of columns used for deletion. It is the id (== 1) column in this example.
  2. content = 2 indicates equality deletes. (This is the content in manifest, not manifest list.)
  3. The .parquet delete file has ids 1 and 3.
Equality delete manifest entry
records:
- data_file:
        content: 2
        file_format: PARQUET
        file_path: demo_output/iceberg/default/users_equality_delete_demo/data/eq-delete-db033e4b-e1ae-434d-8f4d-337ddb3667d3.parquet
        equality_ids:
        - 1
        record_count: 2
        lower_bounds:
        - key: 1
            value: AQAAAA==
        upper_bounds:
        - key: 1
            value: AwAAAA==
    snapshot_id: 6000271643905118120
  status: 1

Updates with equality deletes

An UPDATE is a delete + insert. Both the equality delete file and the new data file are written at the same sequence number (seq=3):

seqfile typecontent
1data fileid=1,2,3,4
3equality deleteid=1
3data fileid=1, value=999
Update write sequence map
seq = 1
Original data file contains id=1,2,3,4.
↓ update writes at seq=3
seq = 3
Equality delete file marks id=1.
seq = 3
New data file writes updated row id=1, value=999.
Rule delete at sequence S suppresses rows where seq_written < S.
Result old row is suppressed (1 < 3), new row is preserved (3 < 3 is false).

write.delete.mode property

write.delete.mode is a table property hint. It tells writers whether to:

Writers can override this hint on a per-operation basis. The property sets the default trade-off: fast writes (merge-on-read) vs. fast reads (copy-on-write).

V3 Deletion Vector

Positional deletes mark the set of (file name, position)s to be deleted. Readers skip those rows on read. Deletion vectors use a roaring bitmap to encode these positions efficiently.

Let's explore this by creating a V3 table and deleting the same rows. This produces one Puffin DV file and references it in the delete manifest.

Table metadata after V3 delete
format-version: 3
properties:
    write.delete.mode: merge-on-read
current-snapshot-id: 505169285673218063
snapshots:
- sequence-number: 1
    summary:
        operation: append
- sequence-number: 2
    summary:
        operation: delete
        added-delete-files: '1'
        added-dvs: '1'
        added-position-deletes: '2'

The following delete-manifest entry references a Puffin file and identifies the exact data file it applies to. A few fields to note:

Deletion vector manifest entry (Puffin)
records:
- data_file:
        content: 1
        file_format: PUFFIN
        file_path: demo_output/iceberg/default/users_v3_deletion_vector_demo/data/00001-1-89bef231-4cbd-4b07-9dea-b5596102da4c-00001.puffin
        referenced_data_file: demo_output/iceberg/default/users_v3_deletion_vector_demo/data/seed-rows-1-4-b08d704d-aec2-414c-8b5e-d2d041784b2e.parquet
        content_offset: 4
        content_size_in_bytes: 44
        record_count: 2
    snapshot_id: 505169285673218063
  status: 1

Deletion vectors are stored as Roaring Bitmaps inside Puffin files. Iceberg uses this format because bitmaps are compact and support fast membership checks and merges. For more detail, see this overview of roaring bitmaps.

Writing Deletion Vectors

DVs require read-before-write. The writer must scan data files to discover row positions before building the bitmap. For DELETE WHERE id > 100, the engine first finds matching row ordinals, then writes the DV.

Each worker handles a disjoint set of data files independently:

Per-worker DV write outcomes
Worker 1 → scans data_file_A → positions {0, 2} → writes dv_A.puffin
Worker 2 → scans data_file_B → positions {5, 9} → writes dv_B.puffin
Worker 3 → scans data_file_C → no matches   → writes nothing

Each worker writes a different Puffin file that includes previous deletes, if any. In this demo flow, that yields one current deletion vector per data file. A coordinator then collects produced DVs and commits them in one snapshot.

Reading Deletion Vectors

Unlike equality deletes (which use sequence numbers), deletion vectors link directly to a data file. Each manifest entry carries referenced_data_file, content_offset, and content_size_in_bytes, so readers can locate the exact bitmap blob without scanning unrelated files.

DV to data-file mapping
Manifest entry

Contains referenced_data_file, content_offset, and content_size_in_bytes.

Puffin blob

Reader seeks to byte offset and reads exactly the DV bitmap bytes.

Referenced data file

Rows are scanned and positions present in bitmap are skipped.

Step 1 Load manifest list and find delete manifests (content=1, file_format=PUFFIN).
Step 2 Seek to content_offset in Puffin and read content_size_in_bytes bytes to decode the roaring bitmap.
Step 3 Open referenced_data_file, iterate rows, and skip rows whose ordinal position is set in the bitmap.

Streaming and Deletes

Equality deletes are simpler for writers — they can mark content as deleted without knowing the row's position in a data file.

One common streaming pattern, followed by Flink is:

  1. Maintains an id -> (file, position) map in memory for the rows it writes.
  2. On delete, uses the position from this map to write a positional delete.
  3. Falls back to equality delete if the mapping doesn't exist.

As noted earlier, equality deletes can add reader overhead. Improving positional-delete support in streaming engines is one of the goals of the secondary indexes proposal (discussion as of May 2026).

RisingWave, a streaming database handles deletion and compaction with two branches: ingestion writes continuously to an ingestion branch (likely with equality deletes), and periodic compaction coalesces delete files, consolidates data files on the main branch to improve read performance.

Conclusion

We covered deletes in Iceberg. Merge on read schemes are efficient for writers, but they shift work to readers.

If you want to inspect these flows in detail, or ask additional questions, try out TableFormatsExplorer!