Automatic Data Platform Optimization

Phil Dakin
ITNEXT
Published in
7 min readNov 4, 2023

--

It is critically important to introduce automated optimization mechanisms that continuously tune large data platform deployments.

Arena

Large-scale deployments of general-purpose analytics platforms are usually not optimized in their performance and cost. To prevent inefficiency as the count and mean size of these deployments both increase, it is critically important to introduce automated optimization mechanisms that continuously tune the deployment to efficiently meet the workload’s needs. These optimization opportunities exist in many layers of the data stack. Users can utilize AI to enable efficient rewriting of SQL queries, and use systems that continuously evaluate platform workloads in order to suggest data model optimizations. From a configuration perspective, analyzing the platform’s workload can enable more efficient sizing of compute and better tuning of platform per-job/global parameters. Behind the user interface, machine learning can be used to auto-optimize internal data structures based on usage patterns. This article surveys automatic data platform optimization techniques along with the most exciting companies currently offering solutions.

Techniques

Continuous optimization can be implemented using the user interface of a given data platform, or can be implemented as a feature internally to the platform.

Some techniques using the user interface are:

  • Analysis Optimization — Rewriting portions of the analysis code (SQL, Spark, etc.).
  • Data Model Optimization — Restructuring platform entities (tables, views, indices).
  • Parameter Tuning — Changing configurations exposed by the data platform.
  • Usage-Based Cleanup — Eliminating platform entities that are not in-use.

Internally to the data platform, optimizations available include:

  • Learned Index Structures — Analyze user workloads to determine the best way to cluster data.

Using Platform Interfaces

Analysis Optimization

Analysis optimization involves statically analyzing analysis code and suggesting improvements. Let’s take a look at an example in SQL. We’ll use a 1 GB TPCH dataset loaded into Postgres, and a sample Tableau query courtesy of Espresso AI’s website. Here’s the original query:

SELECT
t0.N_NAME as N_NAME,
SUM(1) AS "cnt:C_NAME:ok"
FROM
CUSTOMER CUSTOMER
INNER JOIN (
SELECT
CUSTOMER.C_NATIONKEY AS C_NATIONKEY,
NATION.N_NAME AS N_NAME
FROM
CUSTOMER CUSTOMER
LEFT JOIN NATION NATION ON (CUSTOMER.C_NATIONKEY = NATION.N_NATIONKEY)
GROUP BY
1,
2
) t0 ON (CUSTOMER.C_NATIONKEY = t0.C_NATIONKEY)
WHERE
(
CASE
WHEN (
(CUSTOMER.C_MKTSEGMENT IN ('AUTOMOBILE'))
OR (CUSTOMER.C_MKTSEGMENT IS NULL)
) THEN FALSE
ELSE TRUE
END
)
GROUP BY
1;

And its corresponding plan on Postgres:

An automated tool could simplify this query by eliminating the case statement and the unnecessary extra left join.

SELECT
NATION.N_NAME AS N_NAME,
COUNT(1) AS "cnt:C_NAME:ok"
FROM
CUSTOMER CUSTOMER
INNER JOIN NATION NATION ON (CUSTOMER.C_NATIONKEY = NATION.N_NATIONKEY)
WHERE
CUSTOMER.C_MKTSEGMENT NOT IN ('AUTOMOBILE')
AND CUSTOMER.C_MKTSEGMENT IS NOT NULL
GROUP BY
NATION.N_NAME;

After performing such an optimization, we’ll see that the query plan is simplified as well:

Though many optimizations of this sort are detected automatically by the data platform’s optimizer, in some cases static analysis offers further additional improvements that are not detected by the heuristic optimizer in the platform.

Data Model Optimization

In addition to inspecting individual queries for optimization, a continuous optimization system can inspect the analysis workload in aggregate to determine whether the underlying data model is optimal. As a trivial example, consider the following two queries which share an identical inner join clause:

select
c_nationkey,
avg(c_avgspend) as avgspend
from
customer
inner join (
select
o_custkey,
sum(o_totalprice) as c_totalspend,
avg(o_totalprice) as c_avgspend
from
orders
group by
o_custkey
) as a on o_custkey = c_custkey
group by
c_nationkey;
select
c_mktsegment,
max(c_totalspend) as max_totalspend
from
customer
inner join (
select
o_custkey,
sum(o_totalprice) as c_totalspend,
avg(o_totalprice) as c_avgspend
from
orders
group by
o_custkey
) as a on o_custkey = c_custkey
group by
c_mktsegment
order by
max_totalspend asc
limit
5;

A system providing automated data model optimization would analyze this workload, notice that the inner join is recomputed multiple times in each query, and suggest a remodeling of the underlying data to allow this reference data to be precomputed and shared across the queries. Notably, AWS Redshift is already providing a basic version of this type of analysis.

Parameter Tuning

Parameter tuning is a broad category that encompasses changes to performance-relevant data platform configurations. These configurations fall into two categories:

  1. Resource Sizing — Controls the compute resources available to given analytics workloads.
  2. Behavioral — Controls how the platform utilizes the compute resources it has available.

Resource sizing configurations are tuned to ensure that the utilized resources are an appropriate size for the given workload. On Spark-based systems like Databricks, this means determining when to aggregate jobs onto shared clusters, and what size clusters to use. On Snowflake, this means determining how much work to allocate to each warehouse, and what size warehouses to use. Resource sizing can have major effects on cost. For instance, Vestiaire was able to reduce their Snowflake costs by 10% by implementing a warehouse allocation scheme. In a Spark setting, Insider was able to reduce their AWS EMR bill by 25% utilizing Sync Computing’s resource sizing optimization platform, Gradient.

Behavioral configurations are tuned to ensure the platform is using the resources it has wisely. For instance, on Postgres, configurations like the huge page settings and the write-ahead log settings can have major impacts on system performance. On Redshift, changes to compression settings and workload management configuration can also affect performance. Companies like OtterTune seek to encode knowledge of these configurations, and continually tune them for optimal performance.

Usage-Based Cleanup

As the data analysis platform for an organization grows, unused data and jobs will begin to collect. One path to cost reduction is to implement an automated usage-based cleanup system. For instance, Meta’s SCARF system is a daemon which continuously eliminates unused entities from their data platform, which “has removed petabytes of unused data across 12.8M different data types stored in 21 different data systems” since its release. Though most data platforms offer observability into usage patterns, systematically analyzing this information and taking action on it is not straightforward. Companies like ChaosGenius provide this type of analysis as part of their product.

Internal Optimizations

Learned Index Structures

An interesting area of current research is the intersection of ML with database systems. These techniques allow the data platform to automatically learn from user workloads. A key application of these techniques is in learned index structures. Redshift already offers some basic automated index tuning, and AWS recently hired ML-Systems legend Tim Kraska to lead the new Learned Systems Group to further develop these types of optimizations. As Amazon continues to push this sort of optimization forward, other database services will need to provide similar offerings to stay competitive. There is currently no learned index structure equivalent in the Spark ecosystem or at Snowflake.

Companies

OtterTune

OtterTune is an AI-powered optimizer for Amazon RDS and Aurora databases founded by CMU database researchers Andy Pavlo, Dana Van Aken, and Bohan Zhang. OtterTune connects to the database and starts a log of runtime metrics, configuration information, table/index-level metrics, and query histories. Then, ML models are used to predict which information is the most relevant, and output updates to the configuration. For further detail, see their paper here. OtterTune raised $12 million in early 2022 to build this product.

Chaos Genius

Chaos Genius is an observability and optimization platform focused on Snowflake, providing both parameter tuning and analysis optimizations. Chaos Genius was founded by Preeti Shrimal, and their last fundraising was a $3.3 million dollar seed round in January 2023.

Select

Select is an observability and optimization platform focused on Snowflake, providing lightweight parameter tuning optimizations as part of their offering. Select was founded by Ian Whitestone and Niall Woodward in mid-2022. No information about Select’s fundraising is available online.

BlueSky

BlueSky is an optimization platform that provides parameter tuning and analysis optimization for Snowflake deployments. Founded in March 2022 by Mingsheng Hong, the company has raised $8.8 million in seed funding.

Keebo

Keebo is an optimization platform that provides parameter tuning and analysis optimization for Snowflake deployments. Founded in 2020 by Barzan Mozafari and Yongjoo Park, Keebo is powered by a late 2022 $15 million Series A.

EverSQL

EverSQL is a multi-database optimization platform offering that provides analysis and data model optimizations. Founded in 2019 by Oded Valin and Tomer Shay.

EspressoAI

EspressoAI is a platform focused on analysis optimization for Snowflake. It was founded in 2023 by Juri Ganitkevitch, Ben Lerner, Alex Kouzemtchenko, and Nima Badizadegan.

Sync Computing

Sync Computing is focused on parameter tuning for Spark deployments. It was founded in 2019 by Jeff Chou and Suraj Bramhavar, and is propelled by its recent mid-2022 $15.5 million fundraising round.

Unravel

Unravel is an observability and optimization platform that supports Databricks, Snowflake, BigQuery, and Amazon EMR. Founded in 2016 by Kunal Agarwal and Shivnath Babu, Unravel closed a $50 million Series D round in late 2022.

--

--

Data Engineering, Data Science, Programming, Startups | Formerly at Citadel Securities, Ocient