Liquid Clustering

An initial analysis of the new Delta Lake data management technique

Gustavo Martins, Closer Consulting

Current Challenge

When designing your lakehouse tables, defining the partition strategy can be challenging.

The general rules for partitioning and ZORDER columns are known [1], but, not infrequently, data requirements, growth, and usage change over time.

That can present a challenge to the previously defined and fixed data layout, making workloads inefficient.

Presented Solution

Databricks has announced a new feature for Delta Lake 3.0 called Liquid Clustering [2].

This new data management technique can adapt the data layout based on changing patterns, making table design and management easier [2]. It also clusters new data incrementally [3].

In practice, it is only necessary to select as clustering keys the columns that will be queried more often.

Besides the configuration benefits (both initial and ongoing), it is also stated that there is a 2.5x faster ingestion time with a 1 TB table.

Since is not explicit how this metric is calculated, it is assumed as: previous method timespan divided by newer method timespan.


Liquid Clustering Benchmark


In order to compare both methods by replicating a real scenario using synthetic data, columns reference_date and client_id were used as partitioning and ZORDER columns, respectively, for the older method:

-----------------------------------------------------------------------------------
CREATE TABLE table_part (...) PARTITIONED BY (reference_date);
OPTIMIZE table_part ZORDER BY (client_id);
-----------------------------------------------------------------------------------

Although it is stated that the clustering keys can be defined in any order for the Liquid Clustering (LC) method, two tables were created:

-----------------------------------------------------------------------------------
CREATE TABLE table_clust (...) CLUSTER BY (reference_date, client_id);
CREATE TABLE table_clust_inv (...) CLUSTER BY (client_id, reference_date);
-----------------------------------------------------------------------------------

Lastly, 10 additional columns were added. Granted that if this number is modified, it will be reflected in the number of rows (if the dataset's overall total size is maintained), but that can be later tweaked to be closer to any other scenario. The notebook can be found here:

https://github.com/GustavoBMG/liquid_clustering

Due to limitations on budget and GCP quota, the dataset sizes chosen were 12.5 GB, 25 GB and 50 GB.

A n1-highmen-64 single-node cluster was used.


Ingestion Performance

Workloads were ingested in the respective tables, with 1 GB per partition and 15,000 distinct values for the ZORDER column:

Table 1

Adjusting Cardinality

Two additional tests were made with modifications in the cardinality of partition column to 0.5 GB per partition (table 2) and ZORDER column to 75,000 distinct values (table 3):


Table 2

Table 3

Read Performance

Two queries were performed, one with a WHERE clause in the partition column (table 4) and another in the ZORDER column (table 5):

Table 4

Table 5

Summary

From table 1, 2, and 3 is clear that there is a significant gain in ingestion time. It was below what was claimed for most of the tests, but most likely when scaling to larger datasets it will get closer to 2.5x, as we see with 50 GB.

The order of clustering keys seems to have no significant difference in ingestion time, as shown in tables 1, 2, and 3, although a test with a bigger dataset and a higher number of distinct values is worth trying.

As expected, it is shown that Liquid Clustering has a better performance when the cardinality of the key columns is higher, as per table 2. Or, more precisely, that the old method is worse in that scenario, as LC results tend not to vary between tables 1, 2, and 3.

For reading performance, there is a gain timewise as well, as shown in tables 4 and 5, but it seems less prominent. Also, there seem to be differences in performance between tables with LC keys in a different order, but this is not conclusive. That can happen if the dataset is not sufficiently big, or if the LC needs more time to adapt to the queries.

A big remark for the reading performance test is that it was a simple query (again, due to limitations in budget and quota), and that point deserves a more in-depth analysis (perhaps TPC-DS).

In summary, ingestion seems to have improved with LC, even more for +50 GB datasets or with a high number of partitions.

...
Do you want to know more? Schedule a meeting with us here.

We will be glad to share our experience and assist you on your journey.

...

References:

[1] Databricks, When to partition tables on Databricks, Databricks documentation

[2] Databricks, Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering, Databricks blog

[3] Databricks, Use liquid clustering for Delta tables, Databricks documentation

 

Closer Medium Page