Delta Lake Maintenance Cheatsheet
Practical SQL and PySpark snippets for healthy Delta tables: safe DDL, schema evolution, housekeeping, time travel, CDF, monitoring, cleanup, and admin-lite ops.
Bookmark this for future reference
If you want, leave your email so I can notify you about new learning materials for Databricks (Playbooks, Cheatsheets, Hands-on Labs and more!)
1) Create/Register Tables Safely
Managed table (Unity Catalog recommended)
CREATE TABLE IF NOT EXISTS my_catalog.my_schema.sales (
id BIGINT GENERATED ALWAYS AS IDENTITY,
ts TIMESTAMP,
country STRING,
amount DOUBLE,
-- Useful derived column for pruning
dt DATE GENERATED ALWAYS AS (CAST(ts AS DATE))
) USING DELTA
TBLPROPERTIES (
delta.minReaderVersion = '2',
delta.minWriterVersion = '7',
-- Enables safe column renames/drops with name-based column mapping
delta.columnMapping.mode = 'name',
-- Retention and housekeeping (see §3)
delta.logRetentionDuration = '30 days',
delta.deletedFileRetentionDuration = '7 days'
);
External table (register an existing Delta location)
CREATE TABLE my_catalog.my_schema.sales_ext
USING DELTA LOCATION 's3://your-bucket/path/to/delta-table';
Convert existing Parquet folder (one-time)
CONVERT TO DELTA parquet.`s3://your-bucket/path/to/parquet-folder`;
2) Schema Enforcement & Evolution
Explicit changes
ALTER TABLE my_catalog.my_schema.sales ADD COLUMNS (channel STRING);
ALTER TABLE my_catalog.my_schema.sales RENAME COLUMN channel TO sales_channel;
ALTER TABLE my_catalog.my_schema.sales DROP COLUMN sales_channel;
-- Widening/compatible type change (when supported)
ALTER TABLE my_catalog.my_schema.sales ALTER COLUMN amount SET DATA TYPE DECIMAL(18,2);
Automatic schema merging (session)
SET spark.databricks.delta.schema.autoMerge.enabled = true;
Upsert pattern (SCD0/1-like)
MERGE INTO my_catalog.my_schema.sales as t
USING my_catalog.my_schema.sales_updates as s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
3) Housekeeping Table Properties
Adjust retention and CDF
ALTER TABLE my_catalog.my_schema.sales SET TBLPROPERTIES (
delta.logRetentionDuration = '30 days', -- keep transaction log
delta.deletedFileRetentionDuration = '7 days', -- keep deleted files
delta.checkpointInterval = 10, -- commits between checkpoints
delta.enableChangeDataFeed = true, -- enable CDF
delta.appendOnly = 'false' -- set to 'true' for append-only tables
);
SHOW TBLPROPERTIES my_catalog.my_schema.sales;
4) Constraints, Identity, Defaults, Generated Columns
Data quality checks
ALTER TABLE my_catalog.my_schema.sales ADD CONSTRAINT non_negative_amount CHECK (amount >= 0);
-- Drop constraint
ALTER TABLE my_catalog.my_schema.sales DROP CONSTRAINT non_negative_amount;
NOT NULL & DEFAULT values
ALTER TABLE my_catalog.my_schema.sales ALTER COLUMN country SET NOT NULL;
ALTER TABLE my_catalog.my_schema.sales ALTER COLUMN country SET DEFAULT 'UNK';
Identity column (example)
id BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)
5) Time Travel, Restore, and Clones
Query historical data
SELECT * FROM my_catalog.my_schema.sales VERSION AS OF 10;
SELECT * FROM my_catalog.my_schema.sales TIMESTAMP AS OF '2025-01-01 00:00:00';
Restore table (atomic)
RESTORE TABLE my_catalog.my_schema.sales TO VERSION AS OF 10;
RESTORE TABLE my_catalog.my_schema.sales TO TIMESTAMP AS OF '2025-01-01 00:00:00';
Cloning for dev/test
-- A shallow clone creates a cheap, fast copy of the metadata
CREATE OR REPLACE TABLE my_catalog.my_schema.sales_dev SHALLOW CLONE my_catalog.my_schema.sales;
-- A deep clone makes a full physical copy of the data at the time of the clone
CREATE OR REPLACE TABLE my_catalog.my_schema.sales_backup DEEP CLONE my_catalog.my_schema.sales;
6) Change Data Feed (CDF)
Query changes with SQL
SELECT *
FROM table_changes('my_catalog.my_schema.sales', 100, 120); -- versions [start, end]
Read CDF in PySpark (batch)
from pyspark.sql.functions import col
df = (spark.read
.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 100)
.option("endingVersion", 120)
.table("my_catalog.my_schema.sales"))
display(df)
Read CDF in Structured Streaming
sdf = (spark.readStream
.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", "latest")
.table("my_catalog.my_schema.sales"))
query = (sdf.writeStream
.format("delta")
.option("checkpointLocation", "/chk/sales_cdf_consumer")
.toTable("my_catalog.my_schema.sales_cdc_out"))
7) Table Health, Monitoring, and Stats
Quick health checks
DESCRIBE DETAIL my_catalog.my_schema.sales;
DESCRIBE HISTORY my_catalog.my_schema.sales;
Compute/refresh stats
ANALYZE TABLE my_catalog.my_schema.sales COMPUTE STATISTICS;
ANALYZE TABLE my_catalog.my_schema.sales COMPUTE STATISTICS FOR COLUMNS id, ts, amount;
Identify tables with long history
SELECT table_name, max(version) as latest_version
FROM my_catalog.information_schema.table_changes
WHERE table_schema = 'my_schema'
GROUP BY table_name
HAVING latest_version > 500
ORDER BY latest_version DESC;
Manifest for external readers (Presto/Athena/Trino)
GENERATE symlink_format_manifest FOR TABLE my_catalog.my_schema.sales;
8) Data Cleanup & Dedup Patterns
Targeted delete
DELETE FROM my_catalog.my_schema.sales WHERE dt < date_sub(current_date(), 365);
Null normalization
UPDATE my_catalog.my_schema.sales SET country = 'UNK' WHERE country IS NULL;
Idempotent dedup (keep latest per id)
CREATE OR REPLACE TABLE my_catalog.my_schema.sales_dedup AS
WITH ranked AS (
SELECT *,
row_number() OVER (PARTITION BY id ORDER BY ts DESC) AS rn
FROM my_catalog.my_schema.sales
)
SELECT * FROM ranked WHERE rn = 1;
9) Location & Ownership Ops (Admin-lite)
Change table location
ALTER TABLE my_catalog.my_schema.sales SET LOCATION 's3://new-bucket/new/path';
Ownership and grants (Unity Catalog)
ALTER TABLE my_catalog.my_schema.sales OWNER TO `data_engineering_team`;
GRANT SELECT, MODIFY ON TABLE my_catalog.my_schema.sales TO `analyst_role`;
10) Minimal Playbooks
A) New table rollout (safe defaults)
- Create the table with identity, generated date, column mapping, and retention policies.
- Add constraints (NOT NULL, CHECK) and default values.
- Enable CDF if downstream consumers need changes.
- Compute statistics.
B) Upsert pipeline (MERGE) with evolving schema
-
Set
spark.databricks.delta.schema.autoMerge.enabled = truefor the session. -
Use
MERGEwithUPDATE SET */INSERT *. - Add explicit DDL for intentional type changes.
- Use
DESCRIBE HISTORYto audit writes.
C) Incident recovery
-
Validate data with time travel
SELECT ... VERSION AS OF. RESTORE TABLEto a known-good version.- Re-compute stats and notify stakeholders.
11) Rules of Thumb
-
Keep
delta.logRetentionDuration≥delta.deletedFileRetentionDuration. -
Enable
delta.columnMapping.mode = 'name'early to allow renames. -
Use
NOT NULL,CHECK,DEFAULT, andGENERATEDto encode contracts. - Prefer identity columns over client-generated surrogate keys.
- Enable CDF only when you have active consumers.
-
After structural changes or large merges, run
ANALYZE TABLE. -
Use
SHALLOW CLONEfor dev copies andDEEP CLONEfor backups. -
Document every
ALTER TABLEand track viaDESCRIBE HISTORY.
12) PySpark Helper Snippets
from delta.tables import DeltaTable
# Table handle
dt = DeltaTable.forName(spark, "my_catalog.my_schema.sales")
# Simple upsert
(dt.alias("t")
.merge(
spark.table("my_catalog.my_schema.sales_updates").alias("s"),
"t.id = s.id"
)
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
# Inspect history
display(spark.sql("DESCRIBE HISTORY my_catalog.my_schema.sales"))
# Read Change Data Feed (batch)
cdf = (spark.read
.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 100)
.table("my_catalog.my_schema.sales"))
display(cdf)