In database design, the idea of hierarchical data represents relationships between entities as a tree-like structure. This type of data model is widely used in many domains, such as file systems, organizational structure, etc. When dealing with hierarchical data, it is crucial to efficiently query and extract information about the relationships between entities. In this post, we will explore two powerful SQL tools for querying hierarchical data: recursive Common Table Expressions (CTEs) and the CONNECT BY clause.

Hierarchical Data

Hierarchical data represents a natural parent-child relationship that is often visualized in the form of a tree structure. Imagine a family tree: grandparents on top, parents in the middle, and you and siblings at the bottom, all connected. That’s hierarchical data! It organizes information in levels, making it easy to understand how things are related. The most popular real-life example of hierarchical data is employee-manager relationships. Every employee is managed by a manager. And that manager is also an employee and (again) is managed by another manager who is also an employee.

Hierarchical Data example Employee-Manager relationship

Hierarchical Data representation in SQL

Relational models perform best with flat tables with rows and columns, not tree-like structures. However, people developed techniques to represent hierarchical data in SQL. The most common approach is referencing using foreign key. In the above example, we will add a column manager_id referring to the person who managed this employee to the employee table.

EMPLOYEE_IDNAMESALARYMANAGER_ID
1Adam60000NULL
2John300001

*Example SQL table representing hierarchical data structure

Querying Hierarchical Data

When querying hierarchical data, we often want to understand the relationship in both directions: who manages whom and who is managed by whom. However, querying hierarchical data is tricky because we don’t know the depth of the tree, i.e. how many levels of hierarchy there are. Before we look at how to do this in SQL, let’s prepare some data to work with. Note that all SQL code in this post is Oracle, as it natively supports CONNECT BY. Other RDBMS SQL should be similar.

Example data
EMPLOYEE_IDNAMESALARYMANAGER_ID
1Adam60000null
2Sarah70000null
3David500001
4Emily550002
5Michael450001
6Jessica500002
7Ben350003
8Olivia370004
9Charles320005
10Sophia330006
11Alex370003
12Maya380004
13Daniel350005
14Isabella360006
15Ryan250007
16Chloe260008
17Noah240009
18Mia2500010
19Liam2600011
20Evelyn2700012
21William2500013
22Charlotte2600014
23Ethan270007
24Ava280008
25Lucas260009
26Amelia2700010
27Mason2800011
28Harper2900012
29Logan2700013
30Sofia2800014

CREATE TABLE EMPLOYEES (
  EMPLOYEE_ID NUMBER PRIMARY KEY,
  NAME VARCHAR2(50) NOT NULL,
  SALARY NUMBER,
  MANAGER_ID NUMBER REFERENCES EMPLOYEES(EMPLOYEE_ID)
);
INSERT INTO EMPLOYEES VALUES
(1, 'Adam', 60000, NULL),
(2, 'Sarah', 70000, NULL),
(3, 'David', 50000, 1),
(4, 'Emily', 55000, 2),
(5, 'Michael', 45000, 1),
(6, 'Jessica', 50000, 2),
(7, 'Ben', 35000, 3),
(8, 'Olivia', 37000, 4),
(9, 'Charles', 32000, 5),
(10, 'Sophia', 33000, 6),
(11, 'Alex', 37000, 3),
(12, 'Maya', 38000, 4),
(13, 'Daniel', 35000, 5),
(14, 'Isabella', 36000, 6),
(15, 'Ryan', 25000, 7),
(16, 'Chloe', 26000, 8),
(17, 'Noah', 24000, 9),
(18, 'Mia', 25000, 10),
(19, 'Liam', 26000, 11),
(20, 'Evelyn', 27000, 12),
(21, 'William', 25000, 13),
(22, 'Charlotte', 26000, 14),
(23, 'Ethan', 27000, 7),
(24, 'Ava', 28000, 8),
(25, 'Lucas', 26000, 9),
(26, 'Amelia', 27000, 10),
(27, 'Mason', 28000, 11),
(28, 'Harper', 29000, 12),
(29, 'Logan', 27000, 13),
(30, 'Sofia', 28000, 14);

Hierarchical Data example in SQL query

Problem statement

We want to look at it from two directions:

  • Problem 1: Start with individual employees and follow the ladder, revealing who manages them, their manager’s manager, and so on, all the way to the top.
  • Problem 2: Stand at the highest level and look down the hierarchy. For each employee, calculate the total salary of everyone under their direct or indirect management, like a salary pyramid.

When looking at direct relationships (who manages who), a simple join clause will work. But things get tricky when we include multiple levels in the hierarchy.

Recursive CTE

A recursive CTE (Common Table Expression) is a valuable feature in SQL. By referencing itself within the CTE, it allows you to query hierarchical data by repeating a process until it reaches every corner, making it an effective tool for querying and analyzing hierarchical data.

The recursive CTE syntax is not too different from non-recursive one.

Non-recursive CTE:

WITH CTE_NAME (column_1, column2, ...) AS 
(
    -- CTE_QUERY_DEFINITION (SELECT ... FROM ... WHERE)
)

Recursive CTE:

WITH CTE_NAME (column_1, column2, ...) AS
(
    -- ANCHOR_MEMBER (SELECT ... FROM ... WHERE BASE_LEVEL)
    
    UNION (ALL)
    
    -- RECURSIVE_MEMBER (SELECT ... FROM (referrence to CTE_NAME) WHERE ...)
)

The definition of a recursive CTE consists of two parts. The anchor member or the initial query, which is executed once. This defines the starting point of the execution. The recursive query has the reference to the CTE itself and is executed recursively until it returns no result. And we need UNION or UNION ALL to combine the results. You will see how it work when use it to solve real problems.

Problem 1

For each employee, we get the direct manager, the path from the highest-level manager, and the employee’s current level.

WITH company_hierarchy (EMPLOYEE_ID, NAME, MANAGER, MANAGER_PATH, EMPLOYEE_LEVEL) AS
(
    -- Base query. Select ALL employees with no manager
    SELECT
        emp.EMPLOYEE_ID
        ,emp.NAME
        ,NULL AS MANAGER
        ,'' AS MANAGER_PATH
        ,1 AS EMPLOYEE_LEVEL
    FROM EMPLOYEES emp
    WHERE emp.MANAGER_ID IS NULL
    
    UNION ALL
    
    -- Recursive query which refer to the CTE itself.
    -- Query all employee who is managed directly by previous level in company_hierarchy
    SELECT
        emp.EMPLOYEE_ID
        ,emp.NAME
        ,com.NAME AS MANAGER
        ,com.MANAGER_PATH || com.NAME || '/' AS MANAGER_PATH
        ,com.EMPLOYEE_LEVEL + 1 AS EMPLOYEE_LEVEL 
    FROM EMPLOYEES emp
    INNER JOIN company_hierarchy com
        ON emp.MANAGER_ID = com.EMPLOYEE_ID    
)

SELECT *
FROM company_hierarchy

Recursive CTE query result
EMPLOYEE_IDNAMEMANAGERMANAGER_PATHEMPLOYEE_LEVEL
1Adamnullnull1
2Sarahnullnull1
3DavidAdamAdam/2
5MichaelAdamAdam/2
4EmilySarahSarah/2
6JessicaSarahSarah/2
7BenDavidAdam/David/3
11AlexDavidAdam/David/3
9CharlesMichaelAdam/Michael/3
13DanielMichaelAdam/Michael/3
8OliviaEmilySarah/Emily/3
12MayaEmilySarah/Emily/3
10SophiaJessicaSarah/Jessica/3
14IsabellaJessicaSarah/Jessica/3
15RyanBenAdam/David/Ben/4
23EthanBenAdam/David/Ben/4
19LiamAlexAdam/David/Alex/4
27MasonAlexAdam/David/Alex/4
17NoahCharlesAdam/Michael/Charles/4
25LucasCharlesAdam/Michael/Charles/4
21WilliamDanielAdam/Michael/Daniel/4
29LoganDanielAdam/Michael/Daniel/4
16ChloeOliviaSarah/Emily/Olivia/4
24AvaOliviaSarah/Emily/Olivia/4
20EvelynMayaSarah/Emily/Maya/4
28HarperMayaSarah/Emily/Maya/4
18MiaSophiaSarah/Jessica/Sophia/4
26AmeliaSophiaSarah/Jessica/Sophia/4
22CharlotteIsabellaSarah/Jessica/Isabella/4
30SofiaIsabellaSarah/Jessica/Isabella/4

The below picture illustrates how the recursive CTE works:

Recursive CTE in SQL Flowchart

First, the anchor member or initial query was executed. company_hierarchy CTE contained 2 members with MANAGER_ID IS NULL and EMPLOYEE_LEVEL=1.

Then the recursive query was executed, joining the EMPLOYEES table with company_hierarchy to get all members managed by 2 employees currently in company_hierarchy. company_hierarchy was then set to all returned employees (EMPLOYEE_LEVEL=2).

We kept doing so until the recursive query returned nothing.

At the end, we UNION ALL all the results.

Problem 2

This is a bit more difficult. Let’s reread the problem statement and try yourself before reading my answer.

WITH company_explode (MANAGER_ID, EMPLOYEE_ID, SALARY, MANAGEMENT_DISTANCE) AS
(
    -- Base query. Select ALL employees
    SELECT 
        emp.MANAGER_ID
        ,emp.EMPLOYEE_ID
        ,emp.SALARY
        ,1 AS MANAGEMENT_DISTANCE
    FROM EMPLOYEES emp
    
    UNION ALL
    
    -- Recursive query. Select ALL managers manages the employees
    SELECT
        mgr.MANAGER_ID
        ,com.EMPLOYEE_ID
        ,com.SALARY
        ,com.MANAGEMENT_DISTANCE+1 AS MANAGEMENT_DISTANCE
    FROM EMPLOYEES mgr
    INNER JOIN company_explode com
        ON mgr.EMPLOYEE_ID = com.MANAGER_ID
)

SELECT MANAGER_ID, SUM(SALARY) AS TOTAL_SALARY_MANAGED
FROM company_explode
GROUP BY MANAGER_ID

Recursive CTE result
MANAGER_IDTOTAL_SALARY_MANAGED
null1037000
1442000
2465000
3178000
4185000
5169000
6175000
752000
854000
950000
1052000
1154000
1256000
1352000
1454000

The idea behind this solution is simple. First, explode the hierarchical structure of the data into a flattened table of all manager-employee pairs. Then calculate the sum with a simple GROUP BY clause.

START WITH … CONNECT BY …

CONNECT BY clause achieves similar functionality as a recursive Common Table Expression (CTE), but with a much shorter syntax. Let’s take a look at it in action.

Problem 1

SELECT
    emp.EMPLOYEE_ID
    ,emp.NAME
    ,mgr.NAME AS MANAGER
    ,TRIM(LEADING '/' FROM SYS_CONNECT_BY_PATH(mgr.NAME, '/')) AS MANAGER_PATH
    ,LEVEL AS EMPLOYEE_LEVEL
FROM EMPLOYEES emp
LEFT JOIN EMPLOYEES mgr
    ON emp.MANAGER_ID = mgr.EMPLOYEE_ID
START WITH emp.MANAGER_ID IS NULL
CONNECT BY PRIOR emp.EMPLOYEE_ID = emp.MANAGER_ID;

CONNECT BY query result
EMPLOYEE_IDNAMEMANAGERMANAGER_PATHEMPLOYEE_LEVEL
1Adamnullnull1
3DavidAdamAdam2
7BenDavidAdam/David3
15RyanBenAdam/David/Ben4
23EthanBenAdam/David/Ben4
11AlexDavidAdam/David3
19LiamAlexAdam/David/Alex4
27MasonAlexAdam/David/Alex4
5MichaelAdamAdam2
9CharlesMichaelAdam/Michael3
17NoahCharlesAdam/Michael/Charles4
25LucasCharlesAdam/Michael/Charles4
13DanielMichaelAdam/Michael3
21WilliamDanielAdam/Michael/Daniel4
29LoganDanielAdam/Michael/Daniel4
2Sarahnullnull1
4EmilySarahSarah2
8OliviaEmilySarah/Emily3
16ChloeOliviaSarah/Emily/Olivia4
24AvaOliviaSarah/Emily/Olivia4
12MayaEmilySarah/Emily3
20EvelynMayaSarah/Emily/Maya4
28HarperMayaSarah/Emily/Maya4
6JessicaSarahSarah2
10SophiaJessicaSarah/Jessica3
18MiaSophiaSarah/Jessica/Sophia4
26AmeliaSophiaSarah/Jessica/Sophia4
14IsabellaJessicaSarah/Jessica3
22CharlotteIsabellaSarah/Jessica/Isabella4
30SofiaIsabellaSarah/Jessica/Isabella4

Problem 2

SELECT
    emp.MANAGER_ID
    ,SUM(CONNECT_BY_ROOT emp.SALARY) AS TOTAL_SALARY_MANAGED
FROM EMPLOYEES emp
CONNECT BY PRIOR emp.MANAGER_ID = emp.EMPLOYEE_ID
GROUP BY emp.MANAGER_ID

CONNECT BY query result
MANAGER_IDTOTAL_SALARY_MANAGED
null1037000
1442000
2465000
3178000
4185000
5169000
6175000
752000
854000
950000
1052000
1154000
1256000
1352000
1454000

Like magic, isn’t it! The implementation details (what goes on behind the scenes) may be a bit different. But the way we reason about the query is the same as it is with Recursive CTEs. The START WITH clause provides the initial filter. The query is first executed with this filter, just like the anchor member in recursive CTEs. No START WITH clause means no filter is needed and the entire query is included in the first step. Then the CONNECT BY clauses specify how to connect between steps/levels in the hierarchical structure, just like recursive CTEs refer to themselves. Note that the PRIOR keyword implies that the following value is from the previous recursive step.

One of the main differences between the two approaches is how we select data. With CONNECT BY, we have to rely on built-in functions and operators to select the data we want. In the examples, we use the SYS_CONNECT_BY_PATH function to construct the path and the CONNECT_BY_ROOT operator to access the data in the first step.

My final thoughts

Querying hierarchical data presents unique challenges that require specialized techniques to extract meaningful information. Recursive CTEs and the CONNECT BY clause offer powerful solutions for navigating and analyzing hierarchical data in SQL. One interesting fact is that CONNECT BY was actually around before Recursive CTEs.

While both techniques solve the same problem, we can only use one. Which one should you use? Well, it depends on you. If you hate subqueries and CTEs, and you like cool short magic queries, use CONNECT BY. However, being less verbose makes CONNECT BY harder to reason about. It’s harder to write out that magic stuff, and harder to know why it doesn’t work. Also, by writing the recursion yourself, recursive CTEs give you more control and flexibility. And note that not all RDBMS (SQL Server for example) support the CONNECT BY clause, even though it has been around for a long time.

* You can find the execution of SQL in this post at https://dbfiddle.uk/hXEzBJkX