Delta Lake Maintenance Cheatsheet

Practical SQL and PySpark snippets for healthy Delta tables: safe DDL, schema evolution, housekeeping, time travel, CDF, monitoring, cleanup, and admin-lite ops.

Bookmark this for future reference

If you want, leave your email so I can notify you about new learning materials for Databricks (Playbooks, Cheatsheets, Hands-on Labs and more!)

1) Create/Register Tables Safely

Managed table (Unity Catalog recommended)

CREATE TABLE IF NOT EXISTS my_catalog.my_schema.sales (
    id BIGINT GENERATED ALWAYS AS IDENTITY,
    ts TIMESTAMP,
    country STRING,
    amount DOUBLE,
    -- Useful derived column for pruning
    dt DATE GENERATED ALWAYS AS (CAST(ts AS DATE))
) USING DELTA
TBLPROPERTIES (
    delta.minReaderVersion = '2',
    delta.minWriterVersion = '7',
    -- Enables safe column renames/drops with name-based column mapping
    delta.columnMapping.mode = 'name',
    -- Retention and housekeeping (see §3)
    delta.logRetentionDuration = '30 days',
    delta.deletedFileRetentionDuration = '7 days'
);

External table (register an existing Delta location)

CREATE TABLE my_catalog.my_schema.sales_ext
USING DELTA LOCATION 's3://your-bucket/path/to/delta-table';

Convert existing Parquet folder (one-time)

CONVERT TO DELTA parquet.`s3://your-bucket/path/to/parquet-folder`;

2) Schema Enforcement & Evolution

Explicit changes

ALTER TABLE my_catalog.my_schema.sales ADD COLUMNS (channel STRING);

ALTER TABLE my_catalog.my_schema.sales RENAME COLUMN channel TO sales_channel;

ALTER TABLE my_catalog.my_schema.sales DROP COLUMN sales_channel;

-- Widening/compatible type change (when supported)
ALTER TABLE my_catalog.my_schema.sales ALTER COLUMN amount SET DATA TYPE DECIMAL(18,2);

Automatic schema merging (session)

SET spark.databricks.delta.schema.autoMerge.enabled = true;

Upsert pattern (SCD0/1-like)

MERGE INTO my_catalog.my_schema.sales as t
USING my_catalog.my_schema.sales_updates as s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

3) Housekeeping Table Properties

Adjust retention and CDF

ALTER TABLE my_catalog.my_schema.sales SET TBLPROPERTIES (
    delta.logRetentionDuration = '30 days', -- keep transaction log
    delta.deletedFileRetentionDuration = '7 days', -- keep deleted files
    delta.checkpointInterval = 10, -- commits between checkpoints
    delta.enableChangeDataFeed = true, -- enable CDF
    delta.appendOnly = 'false' -- set to 'true' for append-only tables
);

SHOW TBLPROPERTIES my_catalog.my_schema.sales;

4) Constraints, Identity, Defaults, Generated Columns

Data quality checks

ALTER TABLE my_catalog.my_schema.sales ADD CONSTRAINT non_negative_amount CHECK (amount >= 0);

-- Drop constraint
ALTER TABLE my_catalog.my_schema.sales DROP CONSTRAINT non_negative_amount;

NOT NULL & DEFAULT values

ALTER TABLE my_catalog.my_schema.sales ALTER COLUMN country SET NOT NULL;

ALTER TABLE my_catalog.my_schema.sales ALTER COLUMN country SET DEFAULT 'UNK';

Identity column (example)

id BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)

5) Time Travel, Restore, and Clones

Query historical data

SELECT * FROM my_catalog.my_schema.sales VERSION AS OF 10;

SELECT * FROM my_catalog.my_schema.sales TIMESTAMP AS OF '2025-01-01 00:00:00';

Restore table (atomic)

RESTORE TABLE my_catalog.my_schema.sales TO VERSION AS OF 10;

RESTORE TABLE my_catalog.my_schema.sales TO TIMESTAMP AS OF '2025-01-01 00:00:00';

Cloning for dev/test

-- A shallow clone creates a cheap, fast copy of the metadata
CREATE OR REPLACE TABLE my_catalog.my_schema.sales_dev SHALLOW CLONE my_catalog.my_schema.sales;

-- A deep clone makes a full physical copy of the data at the time of the clone
CREATE OR REPLACE TABLE my_catalog.my_schema.sales_backup DEEP CLONE my_catalog.my_schema.sales;

6) Change Data Feed (CDF)

Query changes with SQL

SELECT *
FROM table_changes('my_catalog.my_schema.sales', 100, 120); -- versions [start, end]

Read CDF in PySpark (batch)

from pyspark.sql.functions import col
df = (spark.read
  .format("delta")
  .option("readChangeFeed", "true")
  .option("startingVersion", 100)
  .option("endingVersion", 120)
  .table("my_catalog.my_schema.sales"))
display(df)

Read CDF in Structured Streaming

sdf = (spark.readStream
  .format("delta")
  .option("readChangeFeed", "true")
  .option("startingVersion", "latest")
  .table("my_catalog.my_schema.sales"))

query = (sdf.writeStream
  .format("delta")
  .option("checkpointLocation", "/chk/sales_cdf_consumer")
  .toTable("my_catalog.my_schema.sales_cdc_out"))

7) Table Health, Monitoring, and Stats

Quick health checks

DESCRIBE DETAIL my_catalog.my_schema.sales;
DESCRIBE HISTORY my_catalog.my_schema.sales;

Compute/refresh stats

ANALYZE TABLE my_catalog.my_schema.sales COMPUTE STATISTICS;
ANALYZE TABLE my_catalog.my_schema.sales COMPUTE STATISTICS FOR COLUMNS id, ts, amount;

Identify tables with long history

SELECT table_name, max(version) as latest_version
FROM my_catalog.information_schema.table_changes
WHERE table_schema = 'my_schema'
GROUP BY table_name
HAVING latest_version > 500
ORDER BY latest_version DESC;

Manifest for external readers (Presto/Athena/Trino)

GENERATE symlink_format_manifest FOR TABLE my_catalog.my_schema.sales;

8) Data Cleanup & Dedup Patterns

Targeted delete

DELETE FROM my_catalog.my_schema.sales WHERE dt < date_sub(current_date(), 365);

Null normalization

UPDATE my_catalog.my_schema.sales SET country = 'UNK' WHERE country IS NULL;

Idempotent dedup (keep latest per id)

CREATE OR REPLACE TABLE my_catalog.my_schema.sales_dedup AS
WITH ranked AS (
  SELECT *,
    row_number() OVER (PARTITION BY id ORDER BY ts DESC) AS rn
  FROM my_catalog.my_schema.sales
)
SELECT * FROM ranked WHERE rn = 1;

9) Location & Ownership Ops (Admin-lite)

Change table location

ALTER TABLE my_catalog.my_schema.sales SET LOCATION 's3://new-bucket/new/path';

Ownership and grants (Unity Catalog)

ALTER TABLE my_catalog.my_schema.sales OWNER TO `data_engineering_team`;
GRANT SELECT, MODIFY ON TABLE my_catalog.my_schema.sales TO `analyst_role`;

10) Minimal Playbooks

A) New table rollout (safe defaults)

  1. Create the table with identity, generated date, column mapping, and retention policies.
  2. Add constraints (NOT NULL, CHECK) and default values.
  3. Enable CDF if downstream consumers need changes.
  4. Compute statistics.

B) Upsert pipeline (MERGE) with evolving schema

  1. Set spark.databricks.delta.schema.autoMerge.enabled = true for the session.
  2. Use MERGE with UPDATE SET * / INSERT *.
  3. Add explicit DDL for intentional type changes.
  4. Use DESCRIBE HISTORY to audit writes.

C) Incident recovery

  1. Validate data with time travel SELECT ... VERSION AS OF.
  2. RESTORE TABLE to a known-good version.
  3. Re-compute stats and notify stakeholders.

11) Rules of Thumb

  • Keep delta.logRetentionDurationdelta.deletedFileRetentionDuration.
  • Enable delta.columnMapping.mode = 'name' early to allow renames.
  • Use NOT NULL, CHECK, DEFAULT, and GENERATED to encode contracts.
  • Prefer identity columns over client-generated surrogate keys.
  • Enable CDF only when you have active consumers.
  • After structural changes or large merges, run ANALYZE TABLE.
  • Use SHALLOW CLONE for dev copies and DEEP CLONE for backups.
  • Document every ALTER TABLE and track via DESCRIBE HISTORY.

12) PySpark Helper Snippets

from delta.tables import DeltaTable

# Table handle
dt = DeltaTable.forName(spark, "my_catalog.my_schema.sales")

# Simple upsert
(dt.alias("t")
  .merge(
    spark.table("my_catalog.my_schema.sales_updates").alias("s"),
    "t.id = s.id"
  )
  .whenMatchedUpdateAll()
  .whenNotMatchedInsertAll()
  .execute())

# Inspect history
display(spark.sql("DESCRIBE HISTORY my_catalog.my_schema.sales"))

# Read Change Data Feed (batch)
cdf = (spark.read
  .format("delta")
  .option("readChangeFeed", "true")
  .option("startingVersion", 100)
  .table("my_catalog.my_schema.sales"))
display(cdf)
↑ Top