In database design, the idea of hierarchical data represents relationships between entities as a tree-like structure. This type of data model is widely used in many domains, such as file systems, organizational structure, etc. When dealing with hierarchical data, it is crucial to efficiently query and extract information about the relationships between entities. In this post, we will explore two powerful SQL tools for querying hierarchical data: recursive Common Table Expressions (CTEs) and the CONNECT BY clause.
Hierarchical Data
Hierarchical data represents a natural parent-child relationship that is often visualized in the form of a tree structure. Imagine a family tree: grandparents on top, parents in the middle, and you and siblings at the bottom, all connected. That’s hierarchical data! It organizes information in levels, making it easy to understand how things are related. The most popular real-life example of hierarchical data is employee-manager relationships. Every employee is managed by a manager. And that manager is also an employee and (again) is managed by another manager who is also an employee.
Hierarchical Data representation in SQL
Relational models perform best with flat tables with rows and columns, not tree-like structures. However, people developed techniques to represent hierarchical data in SQL. The most common approach is referencing using foreign key. In the above example, we will add a column manager_id referring to the person who managed this employee to the employee table.
EMPLOYEE_ID | NAME | SALARY | MANAGER_ID |
---|---|---|---|
1 | Adam | 60000 | NULL |
2 | John | 30000 | 1 |
… | … | … | … |
*Example SQL table representing hierarchical data structure
Querying Hierarchical Data
When querying hierarchical data, we often want to understand the relationship in both directions: who manages whom and who is managed by whom. However, querying hierarchical data is tricky because we don’t know the depth of the tree, i.e. how many levels of hierarchy there are. Before we look at how to do this in SQL, let’s prepare some data to work with. Note that all SQL code in this post is Oracle, as it natively supports CONNECT BY. Other RDBMS SQL should be similar.
Example data
EMPLOYEE_ID NAME SALARY MANAGER_ID 1 Adam 60000 null 2 Sarah 70000 null 3 David 50000 1 4 Emily 55000 2 5 Michael 45000 1 6 Jessica 50000 2 7 Ben 35000 3 8 Olivia 37000 4 9 Charles 32000 5 10 Sophia 33000 6 11 Alex 37000 3 12 Maya 38000 4 13 Daniel 35000 5 14 Isabella 36000 6 15 Ryan 25000 7 16 Chloe 26000 8 17 Noah 24000 9 18 Mia 25000 10 19 Liam 26000 11 20 Evelyn 27000 12 21 William 25000 13 22 Charlotte 26000 14 23 Ethan 27000 7 24 Ava 28000 8 25 Lucas 26000 9 26 Amelia 27000 10 27 Mason 28000 11 28 Harper 29000 12 29 Logan 27000 13 30 Sofia 28000 14
CREATE TABLE EMPLOYEES (
EMPLOYEE_ID NUMBER PRIMARY KEY,
NAME VARCHAR2(50) NOT NULL,
SALARY NUMBER,
MANAGER_ID NUMBER REFERENCES EMPLOYEES(EMPLOYEE_ID)
);
INSERT INTO EMPLOYEES VALUES
(1, 'Adam', 60000, NULL),
(2, 'Sarah', 70000, NULL),
(3, 'David', 50000, 1),
(4, 'Emily', 55000, 2),
(5, 'Michael', 45000, 1),
(6, 'Jessica', 50000, 2),
(7, 'Ben', 35000, 3),
(8, 'Olivia', 37000, 4),
(9, 'Charles', 32000, 5),
(10, 'Sophia', 33000, 6),
(11, 'Alex', 37000, 3),
(12, 'Maya', 38000, 4),
(13, 'Daniel', 35000, 5),
(14, 'Isabella', 36000, 6),
(15, 'Ryan', 25000, 7),
(16, 'Chloe', 26000, 8),
(17, 'Noah', 24000, 9),
(18, 'Mia', 25000, 10),
(19, 'Liam', 26000, 11),
(20, 'Evelyn', 27000, 12),
(21, 'William', 25000, 13),
(22, 'Charlotte', 26000, 14),
(23, 'Ethan', 27000, 7),
(24, 'Ava', 28000, 8),
(25, 'Lucas', 26000, 9),
(26, 'Amelia', 27000, 10),
(27, 'Mason', 28000, 11),
(28, 'Harper', 29000, 12),
(29, 'Logan', 27000, 13),
(30, 'Sofia', 28000, 14);
Problem statement
We want to look at it from two directions:
- Problem 1: Start with individual employees and follow the ladder, revealing who manages them, their manager’s manager, and so on, all the way to the top.
- Problem 2: Stand at the highest level and look down the hierarchy. For each employee, calculate the total salary of everyone under their direct or indirect management, like a salary pyramid.
When looking at direct relationships (who manages who), a simple join clause will work. But things get tricky when we include multiple levels in the hierarchy.
Recursive CTE
A recursive CTE (Common Table Expression) is a valuable feature in SQL. By referencing itself within the CTE, it allows you to query hierarchical data by repeating a process until it reaches every corner, making it an effective tool for querying and analyzing hierarchical data.
The recursive CTE syntax is not too different from non-recursive one.
Non-recursive CTE:
WITH CTE_NAME (column_1, column2, ...) AS
(
-- CTE_QUERY_DEFINITION (SELECT ... FROM ... WHERE)
)
Recursive CTE:
WITH CTE_NAME (column_1, column2, ...) AS
(
-- ANCHOR_MEMBER (SELECT ... FROM ... WHERE BASE_LEVEL)
UNION (ALL)
-- RECURSIVE_MEMBER (SELECT ... FROM (referrence to CTE_NAME) WHERE ...)
)
The definition of a recursive CTE consists of two parts. The anchor member or the initial query, which is executed once. This defines the starting point of the execution. The recursive query has the reference to the CTE itself and is executed recursively until it returns no result. And we need UNION or UNION ALL to combine the results. You will see how it work when use it to solve real problems.
Problem 1
For each employee, we get the direct manager, the path from the highest-level manager, and the employee’s current level.
WITH company_hierarchy (EMPLOYEE_ID, NAME, MANAGER, MANAGER_PATH, EMPLOYEE_LEVEL) AS
(
-- Base query. Select ALL employees with no manager
SELECT
emp.EMPLOYEE_ID
,emp.NAME
,NULL AS MANAGER
,'' AS MANAGER_PATH
,1 AS EMPLOYEE_LEVEL
FROM EMPLOYEES emp
WHERE emp.MANAGER_ID IS NULL
UNION ALL
-- Recursive query which refer to the CTE itself.
-- Query all employee who is managed directly by previous level in company_hierarchy
SELECT
emp.EMPLOYEE_ID
,emp.NAME
,com.NAME AS MANAGER
,com.MANAGER_PATH || com.NAME || '/' AS MANAGER_PATH
,com.EMPLOYEE_LEVEL + 1 AS EMPLOYEE_LEVEL
FROM EMPLOYEES emp
INNER JOIN company_hierarchy com
ON emp.MANAGER_ID = com.EMPLOYEE_ID
)
SELECT *
FROM company_hierarchy
Recursive CTE query result
EMPLOYEE_ID NAME MANAGER MANAGER_PATH EMPLOYEE_LEVEL 1 Adam null null 1 2 Sarah null null 1 3 David Adam Adam/ 2 5 Michael Adam Adam/ 2 4 Emily Sarah Sarah/ 2 6 Jessica Sarah Sarah/ 2 7 Ben David Adam/David/ 3 11 Alex David Adam/David/ 3 9 Charles Michael Adam/Michael/ 3 13 Daniel Michael Adam/Michael/ 3 8 Olivia Emily Sarah/Emily/ 3 12 Maya Emily Sarah/Emily/ 3 10 Sophia Jessica Sarah/Jessica/ 3 14 Isabella Jessica Sarah/Jessica/ 3 15 Ryan Ben Adam/David/Ben/ 4 23 Ethan Ben Adam/David/Ben/ 4 19 Liam Alex Adam/David/Alex/ 4 27 Mason Alex Adam/David/Alex/ 4 17 Noah Charles Adam/Michael/Charles/ 4 25 Lucas Charles Adam/Michael/Charles/ 4 21 William Daniel Adam/Michael/Daniel/ 4 29 Logan Daniel Adam/Michael/Daniel/ 4 16 Chloe Olivia Sarah/Emily/Olivia/ 4 24 Ava Olivia Sarah/Emily/Olivia/ 4 20 Evelyn Maya Sarah/Emily/Maya/ 4 28 Harper Maya Sarah/Emily/Maya/ 4 18 Mia Sophia Sarah/Jessica/Sophia/ 4 26 Amelia Sophia Sarah/Jessica/Sophia/ 4 22 Charlotte Isabella Sarah/Jessica/Isabella/ 4 30 Sofia Isabella Sarah/Jessica/Isabella/ 4
The below picture illustrates how the recursive CTE works:
First, the anchor member or initial query was executed. company_hierarchy
CTE contained 2 members with MANAGER_ID IS NULL
and EMPLOYEE_LEVEL=1
.
Then the recursive query was executed, joining the EMPLOYEES
table with company_hierarchy
to get all members managed by 2 employees currently in company_hierarchy
. company_hierarchy
was then set to all returned employees (EMPLOYEE_LEVEL=2).
We kept doing so until the recursive query returned nothing.
At the end, we UNION ALL
all the results.
Problem 2
This is a bit more difficult. Let’s reread the problem statement and try yourself before reading my answer.
WITH company_explode (MANAGER_ID, EMPLOYEE_ID, SALARY, MANAGEMENT_DISTANCE) AS
(
-- Base query. Select ALL employees
SELECT
emp.MANAGER_ID
,emp.EMPLOYEE_ID
,emp.SALARY
,1 AS MANAGEMENT_DISTANCE
FROM EMPLOYEES emp
UNION ALL
-- Recursive query. Select ALL managers manages the employees
SELECT
mgr.MANAGER_ID
,com.EMPLOYEE_ID
,com.SALARY
,com.MANAGEMENT_DISTANCE+1 AS MANAGEMENT_DISTANCE
FROM EMPLOYEES mgr
INNER JOIN company_explode com
ON mgr.EMPLOYEE_ID = com.MANAGER_ID
)
SELECT MANAGER_ID, SUM(SALARY) AS TOTAL_SALARY_MANAGED
FROM company_explode
GROUP BY MANAGER_ID
Recursive CTE result
MANAGER_ID TOTAL_SALARY_MANAGED null 1037000 1 442000 2 465000 3 178000 4 185000 5 169000 6 175000 7 52000 8 54000 9 50000 10 52000 11 54000 12 56000 13 52000 14 54000
The idea behind this solution is simple. First, explode the hierarchical structure of the data into a flattened table of all manager-employee pairs. Then calculate the sum with a simple GROUP BY
clause.
START WITH … CONNECT BY …
CONNECT BY clause achieves similar functionality as a recursive Common Table Expression (CTE), but with a much shorter syntax. Let’s take a look at it in action.
Problem 1
SELECT
emp.EMPLOYEE_ID
,emp.NAME
,mgr.NAME AS MANAGER
,TRIM(LEADING '/' FROM SYS_CONNECT_BY_PATH(mgr.NAME, '/')) AS MANAGER_PATH
,LEVEL AS EMPLOYEE_LEVEL
FROM EMPLOYEES emp
LEFT JOIN EMPLOYEES mgr
ON emp.MANAGER_ID = mgr.EMPLOYEE_ID
START WITH emp.MANAGER_ID IS NULL
CONNECT BY PRIOR emp.EMPLOYEE_ID = emp.MANAGER_ID;
CONNECT BY query result
EMPLOYEE_ID NAME MANAGER MANAGER_PATH EMPLOYEE_LEVEL 1 Adam null null 1 3 David Adam Adam 2 7 Ben David Adam/David 3 15 Ryan Ben Adam/David/Ben 4 23 Ethan Ben Adam/David/Ben 4 11 Alex David Adam/David 3 19 Liam Alex Adam/David/Alex 4 27 Mason Alex Adam/David/Alex 4 5 Michael Adam Adam 2 9 Charles Michael Adam/Michael 3 17 Noah Charles Adam/Michael/Charles 4 25 Lucas Charles Adam/Michael/Charles 4 13 Daniel Michael Adam/Michael 3 21 William Daniel Adam/Michael/Daniel 4 29 Logan Daniel Adam/Michael/Daniel 4 2 Sarah null null 1 4 Emily Sarah Sarah 2 8 Olivia Emily Sarah/Emily 3 16 Chloe Olivia Sarah/Emily/Olivia 4 24 Ava Olivia Sarah/Emily/Olivia 4 12 Maya Emily Sarah/Emily 3 20 Evelyn Maya Sarah/Emily/Maya 4 28 Harper Maya Sarah/Emily/Maya 4 6 Jessica Sarah Sarah 2 10 Sophia Jessica Sarah/Jessica 3 18 Mia Sophia Sarah/Jessica/Sophia 4 26 Amelia Sophia Sarah/Jessica/Sophia 4 14 Isabella Jessica Sarah/Jessica 3 22 Charlotte Isabella Sarah/Jessica/Isabella 4 30 Sofia Isabella Sarah/Jessica/Isabella 4
Problem 2
SELECT
emp.MANAGER_ID
,SUM(CONNECT_BY_ROOT emp.SALARY) AS TOTAL_SALARY_MANAGED
FROM EMPLOYEES emp
CONNECT BY PRIOR emp.MANAGER_ID = emp.EMPLOYEE_ID
GROUP BY emp.MANAGER_ID
CONNECT BY query result
MANAGER_ID TOTAL_SALARY_MANAGED null 1037000 1 442000 2 465000 3 178000 4 185000 5 169000 6 175000 7 52000 8 54000 9 50000 10 52000 11 54000 12 56000 13 52000 14 54000
Like magic, isn’t it! The implementation details (what goes on behind the scenes) may be a bit different. But the way we reason about the query is the same as it is with Recursive CTEs. The START WITH
clause provides the initial filter. The query is first executed with this filter, just like the anchor member in recursive CTEs. No START WITH
clause means no filter is needed and the entire query is included in the first step. Then the CONNECT BY
clauses specify how to connect between steps/levels in the hierarchical structure, just like recursive CTEs refer to themselves. Note that the PRIOR
keyword implies that the following value is from the previous recursive step.
One of the main differences between the two approaches is how we select data. With CONNECT BY
, we have to rely on built-in functions and operators to select the data we want. In the examples, we use the SYS_CONNECT_BY_PATH
function to construct the path and the CONNECT_BY_ROOT
operator to access the data in the first step.
My final thoughts
Querying hierarchical data presents unique challenges that require specialized techniques to extract meaningful information. Recursive CTEs and the CONNECT BY clause offer powerful solutions for navigating and analyzing hierarchical data in SQL. One interesting fact is that CONNECT BY was actually around before Recursive CTEs.
While both techniques solve the same problem, we can only use one. Which one should you use? Well, it depends on you. If you hate subqueries and CTEs, and you like cool short magic queries, use CONNECT BY. However, being less verbose makes CONNECT BY harder to reason about. It’s harder to write out that magic stuff, and harder to know why it doesn’t work. Also, by writing the recursion yourself, recursive CTEs give you more control and flexibility. And note that not all RDBMS (SQL Server for example) support the CONNECT BY clause, even though it has been around for a long time.
* You can find the execution of SQL in this post at https://dbfiddle.uk/hXEzBJkX