Databricks PySpark Configuration Cheatsheet
Essential spark.conf.set() configurations for
Databricks data engineers to optimize performance, memory
management, and job execution.
Bookmark this for future reference
If you want, leave your email so I can notify you about new learning materials for Databricks (Playbooks, Cheatsheets, Hands-on Labs and more!)
1. How to Set Configurations
# Basic syntax
spark.conf.set("configuration.key", "value")
# Example
spark.conf.set("spark.sql.adaptive.enabled", "true")
2. Performance Optimization 🚀
Adaptive Query Execution (AQE)
Enable dynamic query optimization at runtime
AQE rewrites query plans at runtime using observed statistics to coalesce partitions, switch join strategies, and mitigate skew. Use it to stabilize performance when data volumes vary or joins/aggregations suffer from stragglers.
# Enable AQE (enabled by default in Spark 3.2+)
spark.conf.set("spark.sql.adaptive.enabled", "true")
# Enable dynamic partition coalescing
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# Enable dynamic join strategy switching
spark.conf.set("spark.sql.adaptive.localShuffleReader.enabled", "true")
# Enable skew join optimization
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
Shuffle Optimization
Critical for join and aggregation performance
Shuffle settings control parallelism and task sizes for data exchanges that dominate joins and aggregations. Tune partitions to match cluster cores and data size to avoid overhead (too many tiny tasks) or stragglers (too few large tasks).
# Adjust shuffle partitions (default: 200)
# Rule of thumb: Use number of cores in cluster or 2-4x cores
spark.conf.set("spark.sql.shuffle.partitions", "400")
# For small datasets, reduce partitions
spark.conf.set("spark.sql.shuffle.partitions", "50")
# Target partition size for AQE
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "134217728") # 128MB
Broadcast Join Optimization
Speed up joins with small tables
Broadcasting sends a small table to all executors to eliminate shuffles and speed up joins. Use when one side comfortably fits under the threshold; disable or lower the threshold if you see driver/executor memory pressure.
# Increase broadcast threshold (default: 10MB)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "104857600") # 100MB
# For clusters with more memory, increase further
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "209715200") # 200MB
# Disable auto broadcast (use -1)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
3. Memory Management 🧠
Executor Memory Configuration
Balance memory allocation between storage and execution
These settings influence how much memory is available for caching vs. shuffles/sorts. Increase execution memory for heavy joins/aggregations; adjust storage when you rely on caching hot datasets.
# Memory fraction for storage and execution (default: 0.6)
spark.conf.set("spark.memory.fraction", "0.7")
# Storage fraction within unified memory (default: 0.5)
spark.conf.set("spark.memory.storageFraction", "0.3")
# Off-heap memory for better GC performance
spark.conf.set("spark.memory.offHeap.enabled", "true")
spark.conf.set("spark.memory.offHeap.size", "2g")
Driver Memory Limits
Prevent driver OOM errors
Limits the size of results materialized on the driver and reserves overhead for JVM/native usage. Raise when using actions that collect large results or when driver-side libraries need more headroom.
# Maximum result size returned to driver
spark.conf.set("spark.driver.maxResultSize", "4g")
# Driver memory overhead
spark.conf.set("spark.driver.memoryOverhead", "1g")
4. Serialization & Performance ⚡
Kryo Serializer
Faster and more compact than default Java serializer
Kryo reduces serialization time and object size during shuffles, caching, and checkpointing. Enable when processing complex rows or high shuffle volumes to reduce CPU and network overhead.
# Enable Kryo serializer
spark.conf.set("spark.serializer", "org.apache.spark.serialzeer.KryoSerializer")
# Buffer sizes for Kryo
spark.conf.set("spark.kryoserializer.buffer", "64k")
spark.conf.set("spark.kryoserializer.buffer.max", "64m")
Dynamic Resource Allocation
Automatically scale executors based on workload
Lets Spark acquire and release executors as load changes to improve throughput and save cost. Use on autoscaling clusters or bursty pipelines; Databricks manages the shuffle service for you.
# Enable dynamic allocation
spark.conf.set("spark.dynamicAllocation.enabled", "true")
# Minimum and maximum executors
spark.conf.set("spark.dynamicAllocation.minExecutors", "1")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "20")
# Executor idle timeout
spark.conf.set("spark.dynamicAllocation.executorIdleTimeout", "60s")
5. Data Format & I/O Optimization 📊
File Reading Configuration
Optimize file reading performance
Controls how input files are partitioned and how expensive listings are modeled to balance parallelism vs. overhead. Increase partition bytes for many small files; raise open cost when listing millions of files.
# Maximum bytes per partition when reading files
spark.conf.set("spark.sql.files.maxPartitionBytes", "268435456") # 256MB
# Cost to open a file (for partition planning)
spark.conf.set("spark.sql.files.openCostInBytes", "8388608") # 8MB
# Enable parallel file listing
spark.conf.set("spark.sql.sources.parallelPartitionDiscovery.threshold", "16")
Columnar Storage Optimization
Improve caching and compression
Columnar caching compresses and batches data to accelerate
repeated scans from memory. Enable and tune batch size when
using cache()/persist() on datasets
accessed multiple times.
# Enable columnar storage compression
spark.conf.set("spark.sql.inMemoryColumnarStorage.compressed", "true")
# Batch size for columnar caching
spark.conf.set("spark.sql.inMemoryColumnarStorage.batchSize", "20000")
6. SQL & Query Optimization 🎯
SQL Mode Configuration
Control SQL behavior and error handling
ANSI mode enforces stricter SQL semantics, timezone ensures consistent timestamp handling, and CBO uses stats for better plans. Use to make ETL behavior predictable across environments and catch silent errors early.
# Enable ANSI SQL compliance (stricter error handling)
spark.conf.set("spark.sql.ansi.enabled", "true")
# Set timezone for consistent timestamp handling
spark.conf.set("spark.sql.session.timeZone", "UTC")
# Enable cost-based optimization
spark.conf.set("spark.sql.cbo.enabled", "true")
Predicate Pushdown
Push filters closer to data source
Pushdown applies filters and column pruning inside the Parquet reader to reduce IO and speed scans. Keep enabled for analytics workloads; only disable when debugging or working around a data source quirk.
# Enable predicate pushdown for Parquet
spark.conf.set("spark.sql.parquet.filterPushdown", "true")
# Enable column pruning
spark.conf.set("spark.sql.parquet.columnPruning", "true")
7. Development & Debugging 🔧
Memory Profiling
Monitor memory usage in UDFs
Captures Python memory profiles during UDF execution to surface leaks and hot spots. Use selectively in development since it adds overhead, then inspect dumps to guide fixes.
# Enable memory profiling for Python UDFs
spark.conf.set("spark.python.profile.memory", "true")
# Profile duration threshold
spark.conf.set("spark.python.profile.dump", "/tmp/pyspark-profile")
Logging & Monitoring
Better visibility into Spark operations
Event logs feed the Spark History Server and metrics tools; AQE log level helps inspect adaptive decisions. Enable when tuning performance or diagnosing job instability.
# Enable event logging
spark.conf.set("spark.eventLog.enabled", "true")
# Set log level
spark.conf.set("spark.sql.adaptive.logLevel", "INFO")
8. Common Databricks-Specific Configurations 💾
Delta Lake Optimizations
Optimize Delta table operations
Auto Optimize compacts small files and writes efficiently, improving read performance and lowering costs. Disable retention checks only for short-lived test cycles—keep them on in production to protect data retention guarantees.
# Enable auto-optimize for Delta tables
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
# Vacuum retention check
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
Photon Engine
Enable Databricks' native query engine
Photon uses a vectorized C++ execution engine to accelerate SQL/DataFrame workloads. Enable on supported Databricks runtimes for lower latency and better price/performance on analytics queries.
# Enable Photon (if available in your Databricks plan)
spark.conf.set("spark.databricks.photon.enabled", "true")
9. Configuration Best Practices 📝
1. Start with Defaults
Most configurations work well with Databricks defaults. Only tune when you identify specific bottlenecks.
2. Monitor Before Tuning
Use Spark UI and Databricks monitoring to identify performance issues before changing configurations.
3. Test Incrementally
Change one configuration at a time to understand its impact.
4. Consider Data Size
- Small datasets (< 1GB): Reduce shuffle partitions to 50–100
- Medium datasets (1–100GB): Use default or slightly higher partitions
- Large datasets (> 100GB): Increase partitions to 2–4x cluster cores
5. Memory Guidelines
- Leave 10–15% memory overhead for system operations
-
For memory‑intensive operations, increase
spark.memory.fraction - Use off‑heap storage for better garbage collection performance
10. Common Mistakes to Avoid ⚠️
- Setting too many shuffle partitions for small data
- Increasing broadcast threshold beyond driver/executor memory capacity
- Using withColumn() in loops instead of select() with multiple columns
- Not enabling AQE in Spark 3.0+ environments
- Ignoring data skew in join operations
- Over‑allocating executor memory without considering overhead
11. Quick Troubleshooting 🔍
| Problem | Configuration Solution |
|---|---|
| Slow joins |
Increase spark.sql.autoBroadcastJoinThreshold
|
| Driver OOM | Increase spark.driver.maxResultSize |
| Many small tasks | Reduce spark.sql.shuffle.partitions |
| Skewed data |
Enable spark.sql.adaptive.skewJoin.enabled
|
| GC pressure | Enable spark.memory.offHeap.enabled |
| Slow file reads |
Increase spark.sql.files.maxPartitionBytes
|
Remember: Always test configurations in a development environment before applying to production workloads!