Apache Spark 4.0: What Actually Changed

A major release with real breaking changes. Here's what's new, what broke, and whether you should migrate yet.

Spark 4.0 is the first major version bump since 3.0 in 2020. That's four years of accumulated decisions about what to keep, what to clean up, and what to finally cut loose. The headline features are genuinely useful. The breaking changes are genuinely painful if you don't see them coming.

This is a field guide — not a press release. Here's what changed and why it matters for teams running Spark in production.

What's new

Spark Connect is now the default

Spark Connect — the client-server architecture introduced in 3.4 — is now the default way to interact with Spark. Instead of embedding the Spark driver in your application, you talk to a remote Spark server over gRPC.

This is a bigger deal than it sounds. It means:

Your Python/Scala application is decoupled from the Spark version. You can upgrade the cluster without touching client code.
PySpark finally works properly in Jupyter — no more driver-in-notebook hacks.
The client is thin and portable. You can run PySpark from environments where spinning up a full Spark context was impractical.

The tradeoff: some older APIs that depended on being co-located with the driver don't work over Connect. If you use SparkContext directly (RDD-based operations), you'll hit this.

Variant type for semi-structured data

Spark 4.0 introduces a native VARIANT type — a first-class way to store and query semi-structured data like JSON without having to define a fixed schema upfront.

Why this matters

Anyone who's worked with Spark and JSON at scale knows the pattern: you have fields that vary between rows, you don't know the schema ahead of time, so you store it as a string and parse it at query time. It's slow and painful.

Variant stores the data in an efficient binary format but lets you access fields with variant_get() and variant_extract() without a full schema parse. It's essentially what Snowflake's VARIANT type has offered for years — now natively in Spark.

Collation support

Spark now supports proper string collation — rules for how strings are compared and sorted. This sounds like a footnote but it's been a real pain point for anyone doing case-insensitive filtering, international text, or anything where "Apple" and "apple" should be treated as the same value.

You can now specify collation at the column level: STRING COLLATE UNICODE_CI for case-insensitive Unicode comparison. No more workarounds with lower() everywhere.

ANSI SQL compliance improvements

Spark 4.0 continues the push toward proper ANSI SQL behavior. Division by zero now throws an error by default instead of returning null. Integer overflow throws instead of silently wrapping. String-to-number casts fail loudly on bad input.

This will break some existing jobs that were quietly swallowing errors. Which is — arguably — the point. Silent nulls from bad data are worse than loud errors.

Structured Streaming improvements

Streaming gets several quality-of-life upgrades: better support for stateful processing, improved watermark handling, and a new transformWithState operator that gives you more explicit control over state management in streaming aggregations.

If you're running streaming pipelines with complex session windows or custom state logic, this is worth reading carefully.

What broke

Spark 4.0 drops a lot of things that were deprecated in 3.x. If your codebase has been ignoring deprecation warnings, you'll find out now.

Python 3.8 support dropped

Python 3.9 is the minimum. Most teams are already past this, but check your base Docker images and Lambda environments — they sometimes lag.

Scala 2.12 support dropped

Scala 2.13 only. If you have Spark jobs written in Scala and you're still on 2.12, this is your forcing function. The migration from 2.12 → 2.13 is usually smooth but build tooling (SBT plugins, dependencies) sometimes lags.

RDD-based MLlib APIs removed

The spark.mllib package (RDD-based) is gone. Only spark.ml (DataFrame-based) remains. If you have ML pipelines using the old API, this requires real rewriting — not just an import change.

Deprecated DataFrame methods removed

Several methods that were soft-deprecated in 3.x are gone: DataFrame.toDF() with argument, various old UDF registration patterns, and some catalog API methods. Run your jobs against Spark 4 in a dev environment before you promote anything.

Default behavior changes (ANSI mode)

As noted above — division by zero, integer overflow, and bad casts now throw by default. Existing jobs that relied on Spark silently producing nulls for bad operations will fail. Search your codebases for patterns like col / other_col where other_col could be zero.

How to migrate

The Spark project provides an official migration guide, but here's the practical order of operations for a production team:

Run your existing job suite against Spark 4 in dev. Don't read the migration guide first. Just run the jobs. The errors you see are your actual migration list — not a theoretical one from a changelog.
Fix the Python/Scala version issues first. These are environment-level, and everything else blocks on them.
Handle ANSI mode breakage. Search for division operations and explicit casts. Decide whether to fix the upstream data or add try_divide() / try_cast() wrappers. The former is better long-term.
Migrate RDD MLlib code. If you have it, budget real time. This is a rewrite, not a find-and-replace.
Test Spark Connect compatibility. If your jobs use SparkContext directly or use RDD operations in ways that conflict with Connect, you'll see errors. Consider whether you want to migrate fully to Connect or run in legacy mode.

When to upgrade

The honest answer depends on where you are:

Upgrade now if...

You're starting a new project — there's no reason to start on 3.5
You need Variant type — semi-structured data handling at scale is a real pain point this solves
You want Spark Connect's decoupled architecture — especially if your team runs PySpark from diverse environments
Your Databricks / EMR platform has already moved to Spark 4 runtime

Wait if...

You have RDD-based MLlib code that hasn't been migrated — do that work first on 3.5
Your platform hasn't certified Spark 4 yet — don't get ahead of your infrastructure
You're in the middle of a large pipeline rewrite — finish what you're doing, then upgrade
Your team is small and your current 3.5 jobs are stable — the risk/reward isn't there until you need a feature

The bottom line

Spark 4.0 is a good release that cleans up real debt. The Variant type alone makes it worth migrating for teams doing heavy semi-structured data work. Spark Connect maturing as the default is the right architectural direction.

But "good release" doesn't mean "free upgrade." The ANSI mode changes and the MLlib API removal will catch teams that haven't been keeping up with deprecation warnings. Run your jobs against it in dev before you commit anything to production timelines.

The teams that will have the smoothest migrations are the ones that kept dependencies current on 3.x. The teams that will have the hardest time are the ones running 3.2 or 3.3 jobs that haven't been touched in two years. You know which one you are.

← All posts

Disclosure: Written by Gautam Marya with AI assistance.