Snowflake builds Spark clients for its own analytics engine
(2025/08/07)
- Reference: 1754573473
- News link: https://www.theregister.co.uk/2025/08/07/snowflake_builds_spark_clients_for/
- Source link:
Snowflake is launching a client connector to run Apache Spark code directly in its cloud warehouse - no cluster setup required.
This is designed to avoid provisioning and maintaining a cluster running the popular open source analytics engine.
Apache Spark was first [1]introduced in 2014 to help solve big data problems on the Hadoop distributed file system, but it has continued to grow in the cloud era, owing to its popularity in analytics and data preparation.
[2]
Snowflake, on the other hand, began life as an RDBMS data warehouse, part of the generation separating story and compute for the cloud.
Even Netflix struggles to identify and understand the cost of its AWS estate [3]READ MORE
Chris Child, veep of product management at Snowflake, said that customers have been using Spark for a long time, in a lot of cases to process data and get it ready for use in analytics, or in AI.
While some customers pointed out there was a burden in running in separate systems with two different compute engines, types of infrastructure and layers of governance, the effort of re-writing Spark code – usually in Java, Python or Scala in ubiquitous database language SQL – was too much to contemplate migrating Spark workloads.
[4]
[5]
"The feedback we got was it's often very hard to rewrite the type of transformations that people have built," Child told us.
Then the Apache Spark community introduced [6]Spark Connect , which adopts a client-server architecture that allows any client applications to connect to remote Spark clusters.
[7]
With its new Snowpark Connector, Snowflake promises Spark users the same ability to run Spark code in a Spark client, but time linked to a Snowflake analytics engine as a server rather than a separate Spark cluster. It also continues to contribute to the open source Spark project.
How Apache Spark lit up the tech world and outshone its big data brethren [8]READ MORE
"The customers who've been running this in our pre-launch preview, have seen an average of 5.6 times faster performance — run the exact same code on the exact same data — and they're also seeing about a 40 percent cost savings versus traditional spark," Child claimed.
Snowflake claims it allows customers to use its vectorized engine for their Spark code while avoiding the complexity of maintaining or tuning separate Spark environments — including managing dependencies, version compatibility and upgrades. "You can now run all modern Spark DataFrame, Spark SQL and user-defined function code with Snowflake," it said.
Two become one
The move is part of a border consolidation across what had been two distinct markets: data lakes for machine learning and ad hoc analytics; data warehousing for repeatable, query-optimized high-concurrency BI and analytics.
Databricks was built around Spark in its inception to provide data lakes, but has spent the last five years branching out to combine data lake and data warehouses under the "lakehouse" concept. Snowflake, meanwhile, has branched out into providing data lakes on its data platform.
Both approaches have had their critics. In 2021, [9]Gartner pointed out data lakes can struggle to support the number of concurrent users handled by "traditional" data warehouses. Databricks has since stated that it has improved concurrency with its SQL Serverless, designed to provide instant compute to users for their BI and SQL workloads.
[10]Snowflake and Databricks bank PostgreSQL acquisitions to bring transactions onto their platforms
[11]Industry reacts to DuckDB's radical rethink of Lakehouse architecture
[12]Vector search is the new black for enterprise databases
[13]Delta Lake and Iceberg communities collide – in a good way
Snowflake has also received criticism for surprising users with unexpected costs as compute resources flex to support more users. It has spent recent years trying to solve the problem with an optimization strategy which helps customers reduce bills, to the extent that a prominent user — grocery delivery services company InstaCart — [14]surprised market watchers by saying it was slashing tens of millions of dollars off its Snowflake bills over the three years, amid unmerited speculation that it was cutting Snowflake usage.
In tandem, Snowflake has been trying to execute a strategy aimed at getting customers to use its compute engines to work on data, no matter where it is stored. Child told us customers want to store a lot more data than they necessarily wanted to put in Snowflake.
"We've made a huge investment in [15]Apache Iceberg to make it a lot easier for people to do that. We heard from a lot of our customers that they want to process data, not just in SQL, but in other ways. And so we've made a big investment in both Snowpark Connect to make sure that they can bring that code however they want," he said. ®
Get our [16]Tech Resources
[1] https://www.theregister.com/2024/06/14/ten_years_apache_spark/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aJTNlD419fmMafz2_HNEsQAAAAs&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://www.theregister.com/2024/12/18/netflix_aws_management_tools/
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aJTNlD419fmMafz2_HNEsQAAAAs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aJTNlD419fmMafz2_HNEsQAAAAs&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://spark.apache.org/spark-connect/
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aJTNlD419fmMafz2_HNEsQAAAAs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://www.theregister.com/2024/06/14/ten_years_apache_spark/
[9] https://www.theregister.com/2021/05/24/data_lakes_struggle_with_sql_gartner/
[10] https://www.theregister.com/2025/06/10/snowflake_and_databricks_bank_postgresql/
[11] https://www.theregister.com/2025/06/05/ducklake_db_industry_reacts/
[12] https://www.theregister.com/2025/04/24/database_vector_search/
[13] https://www.theregister.com/2025/04/15/iceberg_delta_lake_collab/
[14] https://www.theregister.com/2023/08/31/snowflake_instacart_payments/
[15] https://www.theregister.com/2024/10/03/apache_iceberg_russell_spitzer_interview/
[16] https://whitepapers.theregister.com/
This is designed to avoid provisioning and maintaining a cluster running the popular open source analytics engine.
Apache Spark was first [1]introduced in 2014 to help solve big data problems on the Hadoop distributed file system, but it has continued to grow in the cloud era, owing to its popularity in analytics and data preparation.
[2]
Snowflake, on the other hand, began life as an RDBMS data warehouse, part of the generation separating story and compute for the cloud.
Even Netflix struggles to identify and understand the cost of its AWS estate [3]READ MORE
Chris Child, veep of product management at Snowflake, said that customers have been using Spark for a long time, in a lot of cases to process data and get it ready for use in analytics, or in AI.
While some customers pointed out there was a burden in running in separate systems with two different compute engines, types of infrastructure and layers of governance, the effort of re-writing Spark code – usually in Java, Python or Scala in ubiquitous database language SQL – was too much to contemplate migrating Spark workloads.
[4]
[5]
"The feedback we got was it's often very hard to rewrite the type of transformations that people have built," Child told us.
Then the Apache Spark community introduced [6]Spark Connect , which adopts a client-server architecture that allows any client applications to connect to remote Spark clusters.
[7]
With its new Snowpark Connector, Snowflake promises Spark users the same ability to run Spark code in a Spark client, but time linked to a Snowflake analytics engine as a server rather than a separate Spark cluster. It also continues to contribute to the open source Spark project.
How Apache Spark lit up the tech world and outshone its big data brethren [8]READ MORE
"The customers who've been running this in our pre-launch preview, have seen an average of 5.6 times faster performance — run the exact same code on the exact same data — and they're also seeing about a 40 percent cost savings versus traditional spark," Child claimed.
Snowflake claims it allows customers to use its vectorized engine for their Spark code while avoiding the complexity of maintaining or tuning separate Spark environments — including managing dependencies, version compatibility and upgrades. "You can now run all modern Spark DataFrame, Spark SQL and user-defined function code with Snowflake," it said.
Two become one
The move is part of a border consolidation across what had been two distinct markets: data lakes for machine learning and ad hoc analytics; data warehousing for repeatable, query-optimized high-concurrency BI and analytics.
Databricks was built around Spark in its inception to provide data lakes, but has spent the last five years branching out to combine data lake and data warehouses under the "lakehouse" concept. Snowflake, meanwhile, has branched out into providing data lakes on its data platform.
Both approaches have had their critics. In 2021, [9]Gartner pointed out data lakes can struggle to support the number of concurrent users handled by "traditional" data warehouses. Databricks has since stated that it has improved concurrency with its SQL Serverless, designed to provide instant compute to users for their BI and SQL workloads.
[10]Snowflake and Databricks bank PostgreSQL acquisitions to bring transactions onto their platforms
[11]Industry reacts to DuckDB's radical rethink of Lakehouse architecture
[12]Vector search is the new black for enterprise databases
[13]Delta Lake and Iceberg communities collide – in a good way
Snowflake has also received criticism for surprising users with unexpected costs as compute resources flex to support more users. It has spent recent years trying to solve the problem with an optimization strategy which helps customers reduce bills, to the extent that a prominent user — grocery delivery services company InstaCart — [14]surprised market watchers by saying it was slashing tens of millions of dollars off its Snowflake bills over the three years, amid unmerited speculation that it was cutting Snowflake usage.
In tandem, Snowflake has been trying to execute a strategy aimed at getting customers to use its compute engines to work on data, no matter where it is stored. Child told us customers want to store a lot more data than they necessarily wanted to put in Snowflake.
"We've made a huge investment in [15]Apache Iceberg to make it a lot easier for people to do that. We heard from a lot of our customers that they want to process data, not just in SQL, but in other ways. And so we've made a big investment in both Snowpark Connect to make sure that they can bring that code however they want," he said. ®
Get our [16]Tech Resources
[1] https://www.theregister.com/2024/06/14/ten_years_apache_spark/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aJTNlD419fmMafz2_HNEsQAAAAs&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://www.theregister.com/2024/12/18/netflix_aws_management_tools/
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aJTNlD419fmMafz2_HNEsQAAAAs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aJTNlD419fmMafz2_HNEsQAAAAs&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://spark.apache.org/spark-connect/
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aJTNlD419fmMafz2_HNEsQAAAAs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://www.theregister.com/2024/06/14/ten_years_apache_spark/
[9] https://www.theregister.com/2021/05/24/data_lakes_struggle_with_sql_gartner/
[10] https://www.theregister.com/2025/06/10/snowflake_and_databricks_bank_postgresql/
[11] https://www.theregister.com/2025/06/05/ducklake_db_industry_reacts/
[12] https://www.theregister.com/2025/04/24/database_vector_search/
[13] https://www.theregister.com/2025/04/15/iceberg_delta_lake_collab/
[14] https://www.theregister.com/2023/08/31/snowflake_instacart_payments/
[15] https://www.theregister.com/2024/10/03/apache_iceberg_russell_spitzer_interview/
[16] https://whitepapers.theregister.com/