Introduction to Data Shuffling in Distributed SQL Engines

August 11, 2022

tag1

tag2

Alexey Goncharuk

Founder, General Management

Vladimir Ozerov

Founder, General Management

Abstract

Distributed SQL engines process queries on several nodes. Nodes may need to exchange tuples during query execution to ensure correctness and maintain a high degree of parallelism. This blog post discusses the concept of data shuffling in distributed query engines.

Streams

SQL engines convert a query string to a sequence of operators, which will call an execution plan. We assume that operators in a plan are organized in a tree. Every operator consumes data from zero, one or more child operators, and produces an output that a single parent operator consumes. Practical engines may use DAGs, where several parent operators consume the operator's output, but we ignore such cases for simplicity.

Some caption for the picture

Summary

Distributed SQL engines execute queries on several nodes. To ensure the correctness of results, engines reshuffle operator outputs to meet the requirements of parent operators. Two common shuffling strategies are partitioned and broadcast shuffles.

Both query planner and executor use shuffles. Planner uses distribution metadata to find the optimal placement of shuffle operators. The executor tracks the state of data streams, routes tuples to the proper physical nodes, and may also override planner decisions in the case of data skew.

In future blog posts, we will discuss how query planners decide on the optimal placement of shuffle operators.