Delta Lake Optimization Cheatsheet
Highly actionable SQL and PySpark snippets to keep Delta tables fast and healthy: file sizes, partitions, Z-ORDER/Liquid Clustering, OPTIMIZE, VACUUM, and health checks.
Bookmark this for future reference
If you want, leave your email so I can notify you about new learning materials for Databricks (Playbooks, Cheatsheets, Hands-on Labs and more!)
Quick Rules of Thumb
File Sizes & Layout
- Write fewer, bigger files: target ~128–512 MB data files.
- Partition by low-cardinality, frequently filtered columns (e.g., date), not by high-cardinality.
- Use Z-ORDER for high-cardinality columns used in filters/joins. Prefer ≤4 columns.
- Prefer Liquid Clustering over periodic Z-ORDER for large, evolving tables.
- Enable optimizeWrite and autoCompact at table/cluster level to curb small files from day one.
- Schedule OPTIMIZE daily and VACUUM weekly (≥7 days retention in prod).
- Always test VACUUM with DRY RUN first; don’t break time travel requirements.
When to Act
- Small files growing? Enable optimizeWrite + autoCompact; run OPTIMIZE on hot partitions.
- Slow point lookups/joins on high-card columns? Z-ORDER or enable Liquid Clustering on those columns.
- Heavy daily appends on big tables? Prefer Liquid Clustering over recurring Z-ORDER.
- Queries always filter by date? Partition by date and include date filter in queries.
- Considering bucketing? Don’t on Delta in Databricks; use LC/Z-ORDER instead.
0) Minimal Table Definition You’ll Reuse (SQL)
-- Managed table
CREATE TABLE IF NOT EXISTS my_catalog.my_schema.sales (
id BIGINT,
ts TIMESTAMP,
country STRING,
amount DOUBLE
) USING DELTA
PARTITIONED BY (date(ts))
TBLPROPERTIES (
delta.minReaderVersion = '2',
delta.minWriterVersion = '7',
delta.appendOnly = 'false',
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true,
delta.logRetentionDuration = '30 days',
delta.deletedFileRetentionDuration = '7 days'
);
1) Optimize Write + Auto Compaction
Table-level (recommended)
ALTER TABLE my_catalog.my_schema.sales SET TBLPROPERTIES (
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true
);
Session-level toggles
SET spark.databricks.delta.optimizeWrite.enabled = true;
SET spark.databricks.delta.autoCompact.enabled = true;
Python (Spark) session options
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
When to use: Always enable on write-heavy tables (batch or streaming) to reduce small-file proliferation at the source.
2) OPTIMIZE + Z-ORDER (Query Speed via File/Locality Layout)
Compact small files and co-locate related data
OPTIMIZE my_catalog.my_schema.sales
ZORDER BY (ts, country);
-- Ad-hoc optimize on recent data only
OPTIMIZE my_catalog.my_schema.sales
WHERE ts >= date_sub(current_date(), 30)
ZORDER BY (ts);
Python API
from delta.tables import DeltaTable
delta_table = DeltaTable.forName(spark, "my_catalog.my_schema.sales")
delta_table.optimize().executeCompaction()
delta_table.optimize().executeZOrderBy("ts", "country")
- Use after large batch writes or when many small files accumulate.
- Z-ORDER high-cardinality, frequently filtered/joined columns (≤4 cols).
- Don’t Z-ORDER low-cardinality columns (country/boolean) - partition those instead.
3) Liquid Clustering (Self-Healing Layout at Scale)
Enable or modify clustering
ALTER TABLE my_catalog.my_schema.sales CLUSTER BY (ts, id);
Inspect clustering
DESCRIBE DETAIL my_catalog.my_schema.sales; -- look for clustering columns
SHOW TBLPROPERTIES my_catalog.my_schema.sales LIKE 'delta.liquid%';
Apply maintenance (reclustering runs under OPTIMIZE)
OPTIMIZE my_catalog.my_schema.sales; -- respects liquid clustering
Turn off Liquid Clustering
ALTER TABLE my_catalog.my_schema.sales DROP CLUSTER BY;
When to use: Large/hot tables with continuous writes and diverse query patterns. Prefer LC over repeated Z-ORDER for simpler ops and more stable performance.
4) Partitioning (Pruning, Not a Silver Bullet)
Create with partitioning (low-cardinality columns only)
CREATE TABLE my_catalog.my_schema.events USING DELTA
PARTITIONED BY (dt) AS
SELECT *, CAST(ts AS DATE) AS dt FROM source_view;
Query with partition filters for pruning
SELECT * FROM my_catalog.my_schema.events WHERE dt BETWEEN '2025-01-01' AND '2025-01-31';
Changing partitioning = CTAS into a new table
CREATE OR REPLACE TABLE my_catalog.my_schema.events_repart USING DELTA
PARTITIONED BY (dt)
AS SELECT *, CAST(ts AS DATE) AS dt FROM my_catalog.my_schema.events;
- Choose partitions you almost always filter by (date, region). Keep count per partition reasonable.
- Avoid high-cardinality partitions (user_id) - leads to too many tiny files/directories.
- Combine with LC or Z-ORDER to improve within-partition data skipping.
5) Bucketing (Don’t for Delta on Databricks)
- Classic Hive-style bucketing is not supported for Delta Lake on Databricks and generally not recommended.
- Prefer partitioning + Liquid Clustering or Z-ORDER.
- If you see CLUSTER BY in CTAS/INSERT, that’s a write-time shuffle/ordering hint, not persistent bucketing.
6) VACUUM (Safe File Cleanup)
Always start with a dry run
VACUUM my_catalog.my_schema.sales DRY RUN;
Safe cleanup (default minimum retention = 7 days)
VACUUM my_catalog.my_schema.sales RETAIN 168 HOURS; -- 7 days
Lower retention for dev only (disable safety check)
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM my_catalog.my_schema.sales RETAIN 24 HOURS;
- Keep ≥7 days in prod to preserve time travel SLAs and avoid data loss during long-running jobs.
- Schedule weekly during maintenance windows; test with DRY RUN.
- Align delta.deletedFileRetentionDuration with your recovery requirements.
7) Health Checks & When to Act
Table details & history
DESCRIBE DETAIL my_catalog.my_schema.sales;
DESCRIBE HISTORY my_catalog.my_schema.sales;
Basic signal: file count vs average file size
ANALYZE TABLE my_catalog.my_schema.sales COMPUTE STATISTICS;
Heuristic: tables with many files and small avg file size
SELECT
table_schema,
table_name,
num_files,
size_in_bytes / NULLIF(num_files, 0) AS avg_file_bytes,
CASE WHEN num_files > 1000 AND (size_in_bytes / NULLIF(num_files, 1)) < 134217728 THEN 'NEEDS_OPTIMIZE' ELSE 'OK' END AS status
FROM system.information_schema.table_storage
WHERE table_catalog = 'my_catalog' AND table_schema = 'my_schema'
ORDER BY num_files DESC
LIMIT 100;
8) Maintenance Jobs (Templates)
# Daily OPTIMIZE (off-peak)
for table in ["catalog.schema.table1", "catalog.schema.table2"]:
spark.sql(f"OPTIMIZE {table}")
# Weekly VACUUM (30 days retention)
for table in ["catalog.schema.table1", "catalog.schema.table2"]:
spark.sql(f"VACUUM {table} RETAIN 720 HOURS")
# With Liquid Clustering enabled, a simple OPTIMIZE will recluster as needed.
9) Copy/Paste Fast Start Block (SQL)
-- Enable table-level auto optimizations
ALTER TABLE my_catalog.my_schema.sales SET TBLPROPERTIES (
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true
);
-- Optimize + Z-ORDER (or rely on Liquid Clustering if enabled)
OPTIMIZE my_catalog.my_schema.sales ZORDER BY (ts, country);
-- Turn on Liquid Clustering (optional, modern default for big tables)
ALTER TABLE my_catalog.my_schema.sales CLUSTER BY (ts, id);
-- Weekly cleanup
VACUUM my_catalog.my_schema.sales RETAIN 168 HOURS; -- 7 days
-- Health
DESCRIBE DETAIL my_catalog.my_schema.sales;
DESCRIBE HISTORY my_catalog.my_schema.sales;