Databricks PySpark Configuration Cheatsheet

Essential spark.conf.set() configurations for Databricks data engineers to optimize performance, memory management, and job execution.

Bookmark this for future reference

If you want, leave your email so I can notify you about new learning materials for Databricks (Playbooks, Cheatsheets, Hands-on Labs and more!)

1. How to Set Configurations

# Basic syntax
spark.conf.set("configuration.key", "value")

# Example
spark.conf.set("spark.sql.adaptive.enabled", "true")

2. Performance Optimization 🚀

Adaptive Query Execution (AQE)

Enable dynamic query optimization at runtime

AQE rewrites query plans at runtime using observed statistics to coalesce partitions, switch join strategies, and mitigate skew. Use it to stabilize performance when data volumes vary or joins/aggregations suffer from stragglers.

# Enable AQE (enabled by default in Spark 3.2+)
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Enable dynamic partition coalescing
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# Enable dynamic join strategy switching
spark.conf.set("spark.sql.adaptive.localShuffleReader.enabled", "true")

# Enable skew join optimization
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

Shuffle Optimization

Critical for join and aggregation performance

Shuffle settings control parallelism and task sizes for data exchanges that dominate joins and aggregations. Tune partitions to match cluster cores and data size to avoid overhead (too many tiny tasks) or stragglers (too few large tasks).

# Adjust shuffle partitions (default: 200)
# Rule of thumb: Use number of cores in cluster or 2-4x cores
spark.conf.set("spark.sql.shuffle.partitions", "400")

# For small datasets, reduce partitions
spark.conf.set("spark.sql.shuffle.partitions", "50")

# Target partition size for AQE
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "134217728")  # 128MB

Broadcast Join Optimization

Speed up joins with small tables

Broadcasting sends a small table to all executors to eliminate shuffles and speed up joins. Use when one side comfortably fits under the threshold; disable or lower the threshold if you see driver/executor memory pressure.

# Increase broadcast threshold (default: 10MB)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "104857600")  # 100MB

# For clusters with more memory, increase further
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "209715200")  # 200MB

# Disable auto broadcast (use -1)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

3. Memory Management 🧠

Executor Memory Configuration

Balance memory allocation between storage and execution

These settings influence how much memory is available for caching vs. shuffles/sorts. Increase execution memory for heavy joins/aggregations; adjust storage when you rely on caching hot datasets.

# Memory fraction for storage and execution (default: 0.6)
spark.conf.set("spark.memory.fraction", "0.7")

# Storage fraction within unified memory (default: 0.5)
spark.conf.set("spark.memory.storageFraction", "0.3")

# Off-heap memory for better GC performance
spark.conf.set("spark.memory.offHeap.enabled", "true")
spark.conf.set("spark.memory.offHeap.size", "2g")

Driver Memory Limits

Prevent driver OOM errors

Limits the size of results materialized on the driver and reserves overhead for JVM/native usage. Raise when using actions that collect large results or when driver-side libraries need more headroom.

# Maximum result size returned to driver
spark.conf.set("spark.driver.maxResultSize", "4g")

# Driver memory overhead
spark.conf.set("spark.driver.memoryOverhead", "1g")

4. Serialization & Performance ⚡

Kryo Serializer

Faster and more compact than default Java serializer

Kryo reduces serialization time and object size during shuffles, caching, and checkpointing. Enable when processing complex rows or high shuffle volumes to reduce CPU and network overhead.

# Enable Kryo serializer
spark.conf.set("spark.serializer", "org.apache.spark.serialzeer.KryoSerializer")

# Buffer sizes for Kryo
spark.conf.set("spark.kryoserializer.buffer", "64k")
spark.conf.set("spark.kryoserializer.buffer.max", "64m")

Dynamic Resource Allocation

Automatically scale executors based on workload

Lets Spark acquire and release executors as load changes to improve throughput and save cost. Use on autoscaling clusters or bursty pipelines; Databricks manages the shuffle service for you.

# Enable dynamic allocation
spark.conf.set("spark.dynamicAllocation.enabled", "true")

# Minimum and maximum executors
spark.conf.set("spark.dynamicAllocation.minExecutors", "1")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "20")

# Executor idle timeout
spark.conf.set("spark.dynamicAllocation.executorIdleTimeout", "60s")

5. Data Format & I/O Optimization 📊

File Reading Configuration

Optimize file reading performance

Controls how input files are partitioned and how expensive listings are modeled to balance parallelism vs. overhead. Increase partition bytes for many small files; raise open cost when listing millions of files.

# Maximum bytes per partition when reading files
spark.conf.set("spark.sql.files.maxPartitionBytes", "268435456")  # 256MB

# Cost to open a file (for partition planning)
spark.conf.set("spark.sql.files.openCostInBytes", "8388608")  # 8MB

# Enable parallel file listing
spark.conf.set("spark.sql.sources.parallelPartitionDiscovery.threshold", "16")

Columnar Storage Optimization

Improve caching and compression

Columnar caching compresses and batches data to accelerate repeated scans from memory. Enable and tune batch size when using cache()/persist() on datasets accessed multiple times.

# Enable columnar storage compression
spark.conf.set("spark.sql.inMemoryColumnarStorage.compressed", "true")

# Batch size for columnar caching
spark.conf.set("spark.sql.inMemoryColumnarStorage.batchSize", "20000")

6. SQL & Query Optimization 🎯

SQL Mode Configuration

Control SQL behavior and error handling

ANSI mode enforces stricter SQL semantics, timezone ensures consistent timestamp handling, and CBO uses stats for better plans. Use to make ETL behavior predictable across environments and catch silent errors early.

# Enable ANSI SQL compliance (stricter error handling)
spark.conf.set("spark.sql.ansi.enabled", "true")

# Set timezone for consistent timestamp handling
spark.conf.set("spark.sql.session.timeZone", "UTC")

# Enable cost-based optimization
spark.conf.set("spark.sql.cbo.enabled", "true")

Predicate Pushdown

Push filters closer to data source

Pushdown applies filters and column pruning inside the Parquet reader to reduce IO and speed scans. Keep enabled for analytics workloads; only disable when debugging or working around a data source quirk.

# Enable predicate pushdown for Parquet
spark.conf.set("spark.sql.parquet.filterPushdown", "true")

# Enable column pruning
spark.conf.set("spark.sql.parquet.columnPruning", "true")

7. Development & Debugging 🔧

Memory Profiling

Monitor memory usage in UDFs

Captures Python memory profiles during UDF execution to surface leaks and hot spots. Use selectively in development since it adds overhead, then inspect dumps to guide fixes.

# Enable memory profiling for Python UDFs
spark.conf.set("spark.python.profile.memory", "true")

# Profile duration threshold
spark.conf.set("spark.python.profile.dump", "/tmp/pyspark-profile")

Logging & Monitoring

Better visibility into Spark operations

Event logs feed the Spark History Server and metrics tools; AQE log level helps inspect adaptive decisions. Enable when tuning performance or diagnosing job instability.

# Enable event logging
spark.conf.set("spark.eventLog.enabled", "true")

# Set log level
spark.conf.set("spark.sql.adaptive.logLevel", "INFO")

8. Common Databricks-Specific Configurations 💾

Delta Lake Optimizations

Optimize Delta table operations

Auto Optimize compacts small files and writes efficiently, improving read performance and lowering costs. Disable retention checks only for short-lived test cycles—keep them on in production to protect data retention guarantees.

# Enable auto-optimize for Delta tables
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

# Vacuum retention check
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

Photon Engine

Enable Databricks' native query engine

Photon uses a vectorized C++ execution engine to accelerate SQL/DataFrame workloads. Enable on supported Databricks runtimes for lower latency and better price/performance on analytics queries.

# Enable Photon (if available in your Databricks plan)
spark.conf.set("spark.databricks.photon.enabled", "true")

9. Configuration Best Practices 📝

1. Start with Defaults

Most configurations work well with Databricks defaults. Only tune when you identify specific bottlenecks.

2. Monitor Before Tuning

Use Spark UI and Databricks monitoring to identify performance issues before changing configurations.

3. Test Incrementally

Change one configuration at a time to understand its impact.

4. Consider Data Size

  • Small datasets (< 1GB): Reduce shuffle partitions to 50–100
  • Medium datasets (1–100GB): Use default or slightly higher partitions
  • Large datasets (> 100GB): Increase partitions to 2–4x cluster cores

5. Memory Guidelines

  • Leave 10–15% memory overhead for system operations
  • For memory‑intensive operations, increase spark.memory.fraction
  • Use off‑heap storage for better garbage collection performance

10. Common Mistakes to Avoid ⚠️

  1. Setting too many shuffle partitions for small data
  2. Increasing broadcast threshold beyond driver/executor memory capacity
  3. Using withColumn() in loops instead of select() with multiple columns
  4. Not enabling AQE in Spark 3.0+ environments
  5. Ignoring data skew in join operations
  6. Over‑allocating executor memory without considering overhead

11. Quick Troubleshooting 🔍

Problem Configuration Solution
Slow joins Increase spark.sql.autoBroadcastJoinThreshold
Driver OOM Increase spark.driver.maxResultSize
Many small tasks Reduce spark.sql.shuffle.partitions
Skewed data Enable spark.sql.adaptive.skewJoin.enabled
GC pressure Enable spark.memory.offHeap.enabled
Slow file reads Increase spark.sql.files.maxPartitionBytes

Remember: Always test configurations in a development environment before applying to production workloads!

↑ Top