Soda Core Cheatsheet

The essentials for data quality with Soda Core on Databricks: write SodaCL checks, configure your connection, and run scans from the CLI or notebooks. Copy‑paste ready examples in YAML, SQL, and Python.

Bookmark this for future reference

If you want, leave your email so I can notify you about new learning materials for Databricks (Playbooks, Cheatsheets, Hands-on Labs and more!)

1. The Core Idea: Test Your Data πŸ§ͺ

Soda Core lets you run tests on your data tables, much like a developer runs unit tests on their code. You define rules in a YAML file, and Soda runs queries against your Databricks tables to see if they pass. The process involves three key parts:

  1. configuration.yml: Tells Soda how to connect to your Databricks SQL Warehouse.
  2. checks.yml: Contains your data quality rules, telling Soda what to test.
  3. soda scan: The command you run to execute the tests.

2. How to Write a Check

Anatomy of a checks.yml File

The file is organized into blocks, where each block targets a specific table (which Soda calls a "dataset").

# checks.yml

# Use a "for each dataset" block to target a table.
for each dataset in [customers]:
  # Indent your list of checks for this table here.
  - row_count > 0
  - missing_count(email) = 0

for each dataset in [orders]:
  # A separate list of checks for the 'orders' table.
  - duplicate_count(order_id) = 0

Anatomy of a Single Check

A check has two main parts: the metric (what you measure) and the threshold (the condition for success).

Example: missing_count(email) = 0

  • missing_count(email): This is the metric. Soda calculates the number of NULL values in the email column.
  • = 0: This is the threshold. The check passes only if the result of the metric is 0.

Alert Levels: warn vs. fail

For any check, you can set two levels of alerts: warning 🟑 and failure πŸ”΄. This lets you distinguish between minor issues and critical, pipeline-blocking problems.

  • warn: The check result is a "warning". The scan will be considered "successful with warnings". This is useful for notifications.
  • fail: The check result is a "failure". The scan itself is considered to have failed. This is used to stop a data pipeline or trigger an urgent alert.
for each dataset in [customers]:
  - missing_count(email):
      # Issue a warning if more than 5 emails are missing.
      warn: when > 5
      # Fail the scan if more than 50 emails are missing.
      fail: when > 50

3. SodaCL Syntax Cheatsheet πŸ“–

Row Count & Freshness

# Checks for table size and recent updates
- row_count > 1000
- row_count between 5000 and 10000
- freshness(event_timestamp) < 2d 12h # Fails if latest timestamp is > 2.5 days old

Missing, Invalid & Duplicate Values

# Checks for completeness, validity, and uniqueness
- missing_count(column_name) = 0
- missing_percentage(column_name) < 5%
- duplicate_count(id) = 0
- invalid_count(status) = 0:
    valid values: ["active", "inactive", "pending"]
- invalid_count(notes) = 0:
    forbidden values: ["-", "N/A", "none"]
- invalid_count(email_column) = 0:
    valid format: email

Numeric Distribution & Range

# Validate the distribution and range of numeric data
- avg(temperature) between -10 and 40
- min(price) >= 0
- max_percentile(latency_ms, 99) < 2000
- stddev(score) < 5.5
- anomaly score for order_value < 3 # Z-score based outlier detection
- values in (ratings) must be between 1 and 5

Text & String Patterns

# Validate text-based columns using length, format, or regex
- min_length(name) > 1
- max_length(comment) < 500
- invalid_count(product_sku) = 0:
    valid regex: "[A-Z]{2,4}-\d{5}"
- invalid_count(user_id) = 0:
    starts with: "user_"
- invalid_count(log_message) = 0:
    ends with: "[END]"
- invalid_count(hostname) = 0:
    contains: ".com"

Date & Time Formats

# Validate date and time string formats (uses Java's SimpleDateFormat)
- invalid_count(created_at) = 0:
    valid format: "yyyy-MM-dd HH:mm:ss"

Schema

# Verify column names, types, and order
- schema:
    fail:
      when required column missing: [id, created_at]
      when forbidden column present: [password_hash]
      when column type mismatch:
        id: int
        revenue: decimal(10, 2)
      when column index mismatch:
        id: 0
        email: 1

Column Comparisons & Row-Level Checks

# Write custom rules that apply to each row
- failed rows:
    name: "End date must be after start date"
    fail condition: end_date <= start_date

- failed rows:
    name: "Discount cannot exceed price"
    fail condition: discount_amount > price

Cross-Table & Referential Integrity

# Ensure consistency between two tables
- values in (customer_id) must exist in customers (id)
- row_count same as customers # Compare row counts of two tables

Change-Over-Time

# Note: These checks store historical metrics and often require Soda Cloud.
- change for row_count < 10%
- change for avg(order_value) vs last month < 5%

User-Defined SQL Checks

# For complex business logic, write your own SQL query
# 🟩 SQL Metric Check: The query must return a single numeric value
- sql metric:
    name: "Max price should be under 1000"
    sql: SELECT MAX(price) FROM __table_name__
    must be < 1000

# 🟩 Failed Rows Check: The query must return rows that violate a rule
- failed rows:
    name: "Shipped orders must have a shipped_date"
    fail query: |
      SELECT * FROM __table_name__
      WHERE status = 'shipped' AND shipped_date IS NULL

Applying Checks to Multiple Columns

# Use a "for each column" block to avoid repetition
for each column in [email, phone_number, address]:
  - missing_count = 0

Grouped Checks

# Run checks on segments of your data
- group by:
    group_fields: [country_code]
    checks:
      - row_count > 100 # Fails for any country with <= 100 customers
      - avg(order_value) between 10 and 500 # Checks avg order value per country

4. Setup & Execution

Step 1: Create configuration.yml

This file connects Soda to your Databricks SQL Warehouse. Get the Host, HTTP Path, and a Personal Access Token from your Warehouse's Connection Details.

# configuration.yml
data_source my_databricks_wh:
  type: databricks
  connection:
    host: "dbc-xxxxxxxx-xxxx.cloud.databricks.com"
    http_path: "/sql/1.0/warehouses/xxxxxxxxxxxxxxxx"
    token: "dapixxxxxxxxxxxxxxxxxxxxxxxx"
  schema: "default"

Step 2: Run the Scan (CLI)

Execute your checks from the command line.

# Install the library first
pip install soda-core-databricks

# Run the scan
soda scan -d my_databricks_wh -c configuration.yml checks.yml

5. Programmatic Scans (Databricks Notebook)

Integrate Soda directly into your data pipelines by running it from a notebook.

# In a Databricks notebook cell

# Install the library if not already on the cluster
%pip install soda-core-databricks

from soda.scan import Scan

# 1. Configure the scan
scan = Scan()
scan.set_data_source_name("my_databricks_wh")
scan.add_configuration_yaml(
  """
  data_source my_databricks_wh:
    type: databricks
    connection:
      # Use dbutils.secrets to securely store your credentials
      host: "..."
      http_path: "..."
      token: dbutils.secrets.get("soda", "databricks_token")
  """
)

# 2. Add your SodaCL checks file
scan.add_sodacl_yaml_files("checks.yml")

# 3. Execute the scan
scan.execute()

# 4. Check the results and take action
if scan.has_failures():
  print("🚨 Soda Scan FAILED with critical issues!")
  scan.assert_no_failures() # This will fail the notebook cell
elif scan.has_warnings():
  print("🟑 Soda Scan PASSED with warnings.")
else:
  print("βœ… All data quality checks passed.")

# You can also get a detailed JSON report
print(scan.get_scan_results())

6. Complete Example: checks.yml

Here is a sample file with a variety of checks across several related tables to show how they work together.

#=============== CHECKS FOR CUSTOMER DATA ===============
for each dataset in [customers]:
  # --- Table-level checks ---
  - row_count > 0:
      name: "Table is not empty"
  - freshness(updated_at) < 24h:
      name: "Customer data is fresh"
  - duplicate_count(id) = 0:
      name: "Customer ID is unique"

  # --- Schema validation ---
  - schema:
      name: "Customer schema is correct"
      fail:
        when required column missing: [id, email, country_code, created_at]
        when column type mismatch: {id: string, age: integer, created_at: timestamp}

  # --- Column-level data quality checks ---
  - missing_percentage(email) < 1%:
      name: "Email address is mostly present"
  - invalid_count(email) = 0:
      name: "Email format is valid"
      valid format: email
  - invalid_count(country_code) = 0:
      name: "Country code is 2 characters"
      valid length: 2
  - failed rows:
      name: "Customers must be 18 or older"
      fail condition: age < 18

#=============== CHECKS FOR PRODUCT CATALOG ===============
for each dataset in [products]:
  # --- Table-level checks ---
  - duplicate_count(sku) = 0:
      name: "SKU is unique"
  - missing_count(name) = 0:
      name: "All products have a name"
  
  # --- Column-level data quality checks ---
  - invalid_count(category) = 0:
      name: "Category is from an approved list"
      valid values: ["electronics", "books", "home goods", "apparel"]
  - sql metric:
      name: "Price must be positive"
      sql: SELECT count(*) FROM __table_name__ WHERE price < 0
      must be 0
  - anomaly score for price < 3:
      name: "Product price is not an outlier"

#=============== CHECKS FOR ORDERS FACT TABLE ===============
for each dataset in [orders]:
  # --- Table-level checks ---
  - row_count > 0
  - freshness(order_date) < 2h

  # --- Referential integrity ---
  - values in (customer_id) must exist in customers (id):
      name: "Order links to a valid customer"
  - values in (product_sku) must exist in products (sku):
      name: "Order links to a valid product"

  # --- Row-level business logic ---
  - failed rows:
      name: "Shipped date cannot be before order date"
      fail condition: shipped_date < order_date
  
  # --- Checks on multiple columns ---
  for each column in [order_id, customer_id, product_sku, order_value]:
    - missing_count = 0

  # --- Segmented data quality checks ---
  - group by:
      group_fields: [status]
      checks:
        - row_count > 0
        - avg(order_value) > 0:
            name: "Average order value is positive for each status"
↑ Top