Uncategorized

#65: Forcing a Terabyte Rewrite to Add a Column Is an Architectural Failure

March 20, 2026

1 min read

Most data lakes require full table rewrites for schema changes. Add a risk score to a billion-row spatial dataset? Rewrite all files. Rename a legacy attribute? Rewrite all files. This is why traditional databases and raw Parquet lakes grind to a halt when schemas evolve.

Iceberg decouples schema from data by tracking columns by unique ID, not name. Add a column: pure metadata operation. The underlying files don’t change. The table grows instantly. No compute, no rewrite.

Why this matters for spatial data: Spatial datasets are massive and schema changes are constant. Classification systems get updated. New attributes get added (risk scores, climate projections). Legacy GIS systems have terrible naming. A production zoning layer with 500GB of history can’t afford a full rewrite every time you refine a model.

Iceberg’s ID-based tracking makes this safe. Rename owner_name to property_owner in the schema. The underlying Parquet files still call it owner_name. Iceberg’s metadata maps the ID to both names seamlessly. No data moved. No downstream pipeline breaks.

Dropping columns is equally instant. Remove an obsolete field from the schema metadata. The files retain the bytes (wasted space—minimal). New readers ignore them. Old readers still work because IDs don’t change.

Raw Parquet data lakes can’t do this. Schema drift across partitions creates NullPointerException errors. Renaming requires ETL. Adding columns means rewriting files that already exist.

The rule: Evolve your schema; don’t rewrite your data. Track attributes by ID, not legacy names. If adding a field requires massive compute, you need a table format.

Uncategorized

Updated on Mar 13, 2026

#64: Backing Up Daily Copies of Spatial Data Is a Symptom of a Fragile Architecture

#66: The Fastest Query Engine Can’t Save You From Ten Thousand 10KB Files

Comments

Add a comment

#68: Moving Spatial Files to Cloud Storage Doesn’t Make Them Modern

Adding a transactional metadata layer does. Object storage is cheap but dumb. Multiple writers collide. Silent…

Matt Forrest

March 23, 2026

Uncategorized

#67: Updating a Single Geometry Shouldn’t Require Rewriting a 500MB File

Unless your read performance depends on it. This is the core trade-off between Copy-on-Write and…

Matt Forrest

March 22, 2026

Uncategorized

#66: The Fastest Query Engine Can’t Save You From Ten Thousand 10KB Files

Object storage charges by request, not by GB. Opening a 10KB file costs the same as opening a 100MB file—100ms…

Matt Forrest

March 21, 2026

#75: Loading a Terabyte Into Memory Just to Filter 99% of It Is an Architectural Failure

#74: Executing SELECT * on a Wide Spatial Table Bankrupts Your Cloud Budget

#73: Treating Metadata as an Afterthought Turns Your Data Lake Into an Unsearchable Graveyard

#65: Forcing a Terabyte Rewrite to Add a Column Is an Architectural Failure

#64: Backing Up Daily Copies of Spatial Data Is a Symptom of a Fragile Architecture

#66: The Fastest Query Engine Can’t Save You From Ten Thousand 10KB Files

Leave a Reply Cancel reply

#75: Loading a Terabyte Into Memory Just to Filter 99% of It Is an Architectural Failure

#74: Executing SELECT * on a Wide Spatial Table Bankrupts Your Cloud Budget

#73: Treating Metadata as an Afterthought Turns Your Data Lake Into an Unsearchable Graveyard

#72: Stop Parsing Text. Start Reading Typed Geometries.

Read Next

#68: Moving Spatial Files to Cloud Storage Doesn’t Make Them Modern

#67: Updating a Single Geometry Shouldn’t Require Rewriting a 500MB File

#66: The Fastest Query Engine Can’t Save You From Ten Thousand 10KB Files

What are you looking for?

#65: Forcing a Terabyte Rewrite to Add a Column Is an Architectural Failure

#64: Backing Up Daily Copies of Spatial Data Is a Symptom of a Fragile Architecture

#66: The Fastest Query Engine Can’t Save You From Ten Thousand 10KB Files

Leave a Reply Cancel reply

Read Next