#42: The Fastest Queries Are Engineered During the Write Process

Share

Most teams write GeoParquet with geopandas.to_parquet() and assume it’s optimized. It’s not. You’ve traded read performance for write convenience.

The problem: GeoPandas loads the entire dataset into memory before writing. Works for 1GB. Fails for 100GB. Memory runs out. The query engine reading that file later can’t benefit from spatial clustering because the data was unsorted during the write.

Streaming writes are the alternative. Instead of loading everything, batch the data and serialize row-by-row. PyArrow’s RecordBatch API or DuckDB’s COPY ... TO PARQUET handle this natively. You can write terabytes without touching RAM limits.

During writes, two expensive things happen: WKB serialization (converting raw coordinates to binary), and BBOX calculation (computing min/max bounds for every row group to populate the metadata footer). This metadata is what enables predicate pushdown during reads. You’re buying query speed with CPU cost at write time.

Schema enforcement amplifies this. A column with mixed geometry types (Points and Polygons) prevents vector optimizations in readers. A column with strict single types unlocks faster decompression and spatial filtering.

For small data (a few GB), GeoPandas is fine. For production pipelines, use PyArrow or DuckDB with explicit schema and batch sizing. Sort by space (Hilbert curve) during the write to maximize predicate pushdown later.

The rule: Stream your writes to scale your reads. Never hold the world in RAM. The metadata footer—calculated during write—determines query performance months later.

Whenever you’re ready, here are 4 ways I can help you grow in GIS & spatial data:

​Spatial Lab​ – My private community where GIS professionals, data engineers, and analysts connect, swap workflows, and build repeatable systems together.

​Modern GIS Accelerator​ – A guided program to help you break out of legacy GIS habits and learn modern, scalable workflows.

​Career Compass​ – A career-focused program designed to help GIS pros navigate the job market, sharpen their pitch, and find roles beyond traditional GIS paths.

​Sponsorship​: Interested in sponsoring this newsletter (or other content)? ​Learn more here​ and fill out the form to get in touch!

#41: Compression Is a Performance Dial, Not Just Storage Savings

Prev

43: True Scalability Isn’t Bigger Servers, It’s Removing the Server Entirely

Next
Comments
Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Get every update, in your inbox.
Get every update, in your inbox.
Get every update, in your inbox.
One tip, every day
Get every update, in your inbox.
Subscribe below and join 11,000+ others learning modern GIS.