<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Dat a Engineer</title><link>https://note.datengineer.dev/</link><description>Recent content on Dat a Engineer</description><image><title>Dat a Engineer</title><url>https://note.datengineer.dev/images/cover.png</url><link>https://note.datengineer.dev/images/cover.png</link></image><generator>Hugo -- 0.147.5</generator><language>en-us</language><lastBuildDate>Wed, 18 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://note.datengineer.dev/index.xml" rel="self" type="application/rss+xml"/><item><title>Handling Concurrent Inserts: From Single Database to Distributed</title><link>https://note.datengineer.dev/posts/handling-concurrent-inserts-from-single-database-to-distributed/</link><pubDate>Wed, 18 Mar 2026 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/handling-concurrent-inserts-from-single-database-to-distributed/</guid><description>Description</description><content:encoded><![CDATA[<p>During a recent technical meeting, one of my colleagues raised a concern about two concurrent inserts for the same object from different processes hitting the system at once. After explaining this to him, I realized that even among experienced engineers, the inner workings of the database are often treated as a black box. There is a persistent uncertainty about what the database guarantees and what it leaves up to developers.</p>
<h2 id="problem-statement">Problem Statement</h2>
<h3 id="user-registration-flow">User Registration Flow</h3>
<p>To understand the concurrent inserts problem, the most common example is the user registration flow. When a user reaches our system, they first select a username. The system must guarantee that this username belongs to one and only one person.</p>
<p><img alt="User Registration Flow" loading="lazy" src="/posts/handling-concurrent-inserts-from-single-database-to-distributed/images/database-concurrent-insert-demo-user-registration-flow.png"></p>
<h3 id="check-then-act">Check-Then-Act</h3>
<p>After you see the requirement, you sit down to code immediately. You simply follow a linear logical path. You first check if the username exists in the database. If the username exists, you throw an error to users. Otherwise, you insert a new user record to the database. So, you just implemented what is known as the Check-Then-Act pattern. The code usually looks like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Pseudo Python code</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">register</span><span class="p">(</span><span class="n">username</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Step 1: The Check</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># We query the database to see if the username is already claimed.</span>
</span></span><span class="line"><span class="cl">    <span class="n">existing_user</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&#34;SELECT id FROM users WHERE username = </span><span class="si">%s</span><span class="s2">&#34;</span><span class="p">,</span> <span class="p">(</span><span class="n">username</span><span class="p">,))</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">existing_user</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 2: The Act</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># If the result is empty, we assume the coast is clear and perform the write.</span>
</span></span><span class="line"><span class="cl">        <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&#34;INSERT INTO users (username) VALUES (</span><span class="si">%s</span><span class="s2">)&#34;</span><span class="p">,</span> <span class="p">(</span><span class="n">username</span><span class="p">,))</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;Registration successful.&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;Error: Username already exists.&#34;</span>
</span></span></code></pre></div><p>This pattern is incredibly common because it is intuitive. It is how we think and act in real life. We check if a parking spot is empty before we pull the car in. It is how we are taught to code at school. And it looks the same as the received requirements.</p>
<h3 id="the-race-condition">The Race Condition</h3>
<p>The fatal flaw in &ldquo;Check-Then-Act&rdquo; is the tiny window of time between the &ldquo;Check&rdquo; and the &ldquo;Act&rdquo;. Imagine that two users, Alice and Bob, both try to claim the username &ldquo;user1&rdquo; at the same exact millisecond.</p>
<ul>
<li>Thread A (Alice) checks the database and sees that &ldquo;user1&rdquo; is available.</li>
<li>Thread B (Bob) also checks the database, but a fraction of a second before Thread A commits its write. Thread B also sees that &ldquo;user1&rdquo; is available.</li>
<li>Both threads then proceed to Insert.</li>
</ul>
<p>You now have two rows in your table with the username &ldquo;user1&rdquo;. Your authentication and authorization logic could be broken.</p>
<p><img alt="User Registration Race Condition Example" loading="lazy" src="/posts/handling-concurrent-inserts-from-single-database-to-distributed/images/database-concurrent-insert-conflict-race-condition-example.png"></p>
<p>The &ldquo;Check-Then-Act&rdquo; pattern is even more fragile in a production environment because a single database can hardly handle massive real-world loads. For this reason, we often distribute the load across multiple databases. In the simplest setup, all writes are routed to a leader database, and all reads are distributed among replicas. It means that the &ldquo;Check&rdquo; may interact with one database while the &ldquo;Act&rdquo; is performed on another. And the replication lag between these databases makes the race condition more likely to happen.</p>
<h2 id="how-to-insert">How to <code>INSERT</code>?</h2>
<p>To find the solution, we first need to understand one important aspect of <code>INSERT</code>. When inserting data, we usually provide values for all columns except the primary key. We often rely on databases to generate an auto-increment value for us. Two update statements could possibly block if they interact with the same primary key. But two insert statements would never do so because they always use different primary keys. Therefore, two concurrent inserts always succeed and will cause duplication.</p>
<p>The above behavior gives us an insightful clue. No matter how many concurrent inserts hit your database, it guarantees that every new row gets a distinct ID. And the fact that the database assign IDs in auto-increment order (e.g., 1, 2, 3, 4, 5, etc.) tells us it has a mechanism to process these IDs sequentially, even if insert requests may arrive at the same time.</p>
<h3 id="unique-constraint">Unique Constraint</h3>
<p>The problem is not the database cannot handle simultaneous inserts. The problem is we must tell the database that usernames require the same strict treatment as IDs. And we can solve this by simply adding a UNIQUE constraint to the username column.</p>
<p>Thanks to the unique constraint, duplication is no longer a concern. The database will ensure that, at any given time, no two users share the same username.</p>
<h3 id="just-insert">Just <code>INSERT</code></h3>
<p>Once the UNIQUE constraint has been set up, you can insert data without hesitation. We no longer need existence checks beforehand. We abandon the &ldquo;Check-Then-Act&rdquo; pattern entirely. We just insert.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Pseudo Python code</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">register</span><span class="p">(</span><span class="n">username</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># Just try to insert. </span>
</span></span><span class="line"><span class="cl">        <span class="c1"># The database is the only one who knows the truth.</span>
</span></span><span class="line"><span class="cl">        <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&#34;INSERT INTO users (username) VALUES (</span><span class="si">%s</span><span class="s2">)&#34;</span><span class="p">,</span> <span class="p">(</span><span class="n">username</span><span class="p">,))</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;Success&#34;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="n">UniqueViolationError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># The database rejected us. </span>
</span></span><span class="line"><span class="cl">        <span class="c1"># This is a valid business logic outcome, not a system crash.</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;Error: Username already exists.&#34;</span>
</span></span></code></pre></div><p>Databases will handle our concurrent inserts as if they were executed sequentially. So, Alice and Bob both try to register the username &ldquo;user1.&rdquo; Their requests arrive at the same time. One request will be executed first and succeed. The other will wait, then violate the unique constraint and fail. We will catch that failure and prompt the user to choose another username.</p>
<h2 id="multi-leader">Multi-Leader</h2>
<p>The &ldquo;Just Insert&rdquo; approach is clean. It works perfectly in a single-leader setup. So, we ship it and life is good.</p>
<p>Then, the product grows.</p>
<h3 id="why-multi-leader">Why Multi-Leader?</h3>
<p>A single leader database can only handle a certain number of writes per second. When users register, post, and update their profiles all at once, the database becomes the bottleneck. You also start thinking about geography. A user in Tokyo shouldn&rsquo;t have to wait for a round trip to a server in London just to register. Latency adds up. Users leave.</p>
<p>So, you scale out. Promote multiple nodes to accept writes. Each region has its own leader database that accepts writes. Writes go to the nearest database. Then, the leaders synchronize with each other in the background. This is called multi-leader replication.</p>
<p>Reads are fast. Writes are fast. Everything feels great. Until you think about usernames again.</p>
<h3 id="why-unique-not-work">Why UNIQUE not work?</h3>
<p>The UNIQUE guarantee is only valid where there is one database. In a multi-leader setup, Leader A in Tokyo and Leader B in London are both accepting writes independently. They do not talk to each other before committing. Alice registers &ldquo;user1&rdquo; on Leader A. Bob registers &ldquo;user1&rdquo; on Leader B. Both leaders check their own local data. Both see no conflict. Both succeed. Both return a success response to their respective users.</p>
<p>Now both leaders have a row for &ldquo;user1&rdquo;. The UNIQUE constraint on each node was never violated locally. The violation only becomes visible after the leaders then sync with each other.</p>
<h3 id="solutions">Solutions</h3>
<p>There is no magic answer here. Every solution is a trade-off between consistency, availability, and difficulty. You have to pick one that works best for you.</p>
<p>The simplest method is optimistic conflict resolution. You allow conflict. No locks. No coordination. No waiting. Every INSERT request succeeds instantly. And you change business requirement to resolve conflicts.</p>
<p>The most common strategy is Last-Write-Win. For the user registration problem, you might want to use First-Write-Win. The idea is simple, you keep one and discard others. You can also keep all, but with additional discriminators to enforce uniqueness. Discord used to take this approach. They appended a 4-digit numeric discriminator to every username. So that, in theory, they allowed 10,000 users sharing the username &ldquo;user1&rdquo;, ranging from &ldquo;user1#0000&rdquo; to &ldquo;user1#9999&rdquo;.</p>
<p>The fundamental problem with this approach is the user experience. In most cases, such as when a user updates their profile image from both their laptop and phone at the same time, it is safe to keep the latest one. However, this is not as easy with uniqueness-sensitive data, such as usernames. You definitely do not want to tell a user that they have successfully taken a username. Then, when they log in the next time, you have to apologize because you found out that someone had taken that name before them. Even the discriminator is also a poor user experience. No one remembers a random number associated with their name. Discord has finally decided to move away from that approach.</p>
<p>Another method is to stop fighting the problem of multi-leader setup and route uniqueness-sensitive data to a dedicated service backed by a single-leader database. So, you can keep everything else in the multi-leader system for performance, but send user registration to the single-leader database. Your main database stays distributed and fast. Only the username is centralized, because it is the only part that genuinely needs to be.</p>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>The moment you start treating the database as a partner rather than a black box, a lot of problems either disappear entirely or become much easier to reason about. You stop being surprised by behaviors that are actually well-documented. You start asking better questions. Understanding your tools in depth is especially important in the era of AI. AI can provide you code that looks correct. But the decision of whether that code is actually correct is still yours.</p>
]]></content:encoded></item><item><title>Learn the basics in depth</title><link>https://note.datengineer.dev/posts/learn-the-basics-in-depth/</link><pubDate>Thu, 22 May 2025 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/learn-the-basics-in-depth/</guid><description>Going deep on the basics makes you a stronger, better engineer - and helps you stay relevant in the AI era.</description><content:encoded><![CDATA[<p>I often receive questions from aspiring data engineers. Some are fresh grads, others are switching from software or analytics roles. And a same question appears so many times:</p>
<blockquote>
<p>What tool should I learn in depth?</p></blockquote>
<p>I understand why people keep having such questions. The tech world moves extremely fast. Every few months, there is a new framework, a new orchestration tool, a shiny feature in a cloud, or some articles filled with buzzwords that make you feel like you are already behind. The pressure to keep up is real. But here is something I have learned over the years, and I want to say it clearly: <em><strong>focus on learning the basics in depth.</strong></em> Tools will change, 0 and 1 will not.</p>
<p>Years ago, everyone was talking about HDFS, MapReduce, Pig, Hive. Fast-forward a few years, Spark took over. Then cloud-native pipelines. Now we have got real-time streaming, feature stores, vector databases, and AI-generated pipelines. If I had spent all my time chasing tools, I would be exhausted. Always playing catch-up - and probably still behind.</p>
<p>Instead, I focused on understanding how data actually works; how it moves; how it is stored and transformed; how to model it for clarity, flexibility, and performance; how SQL engines work under the hood; how replication and partitioning affects performance, how to make a data schema clean and extensible. Those things have not changed. These lessons apply whether you are working with BigQuery or Snowflake, self-hosted DBT or Azure Data Factory on the cloud, or something that does not even exist yet.</p>
<p>Data Engineering is developing in a direction where engineers write less and less code. Unlike several years ago, you now have a lot of tools publicly available. For a new project, you are provided with a comprehensive toolbox. Your primary task then becomes picking the right tools and making them work seamlessly together. By writing less code, it makes the engineers harder to understand the technology behind the scene. I began my technology journey by writing a lot of code in Pascal - my favorite programming language. I still write some nowadays in my free time. My coding experience helps me a lot. When I work with a tool today, I can often imagine the actual code running on the machine.</p>
<p>Yes, you are not required to be able to write the tools from scratch. But, you are required to understand the core technology behind the tools. Because even if you are just picking items from your toolbox, you still need to pick the right tools. And picking the right tools for the right jobs is still a highly advanced skill in our industry today.</p>
<p>Here is another undeniable truth we can not ignore: AI is becoming incredibly good at repeating what it has learned, and it is getting more and more involved in our daily work. Honestly, I use it a lot to help me develop. This fact actually highlights the advantages of understanding the basics. If AI is not good enough, people need you because you understand the underlying concepts and can operate the AI reliably. If the AI is already good enough, why do they still need you? Because you are better than AI at understanding the things it does not know - the true &ldquo;why&rdquo; behind the &ldquo;what&rdquo;.</p>
<p>When you understand the basics, tools become just syntax. You can pick up new ones quickly. You can even build your own if you want to. Investors care about one thing. It is the value you created, not the tools you used. They want to know: Did you help the business make better decisions? Did you save money? Did you unlock insights faster? Those results do not come from stacking the flashiest tools, but from engineers who know what matters and make the right decisions. That is what turns you from someone who follows instructions into someone who builds solutions.</p>
<p>And in this field, that is the difference between being useful and being indispensable.</p>
]]></content:encoded></item><item><title>Introducing Dat⋂nalytics - My new home to share insights</title><link>https://note.datengineer.dev/posts/introducing-analytics-datengineer/</link><pubDate>Wed, 26 Feb 2025 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/introducing-analytics-datengineer/</guid><description>Discover Dat⋂nalytics, a new platform where I share interactive Power BI reports and data-driven insights.</description><content:encoded><![CDATA[<h2 id="a-new-chapter">A new chapter</h2>
<p>Last year, I launched <a href="https://note.datengineer.dev/">Dat⋂Engineer</a> as a place where I share my thoughts, experiences, and lessons learned as a data engineer. From designing to coding, I have written about the craft of data engineering with passion.</p>
<p>This year, I wanted to go beyond words. I wanted to showcase data - to let data and visualizations tell their own stories. That is why I created <a href="https://analytics.datengineer.dev/">Dat⋂nalytics</a>, a new site for sharing interactive Power BI reports and data-driven insights.</p>
<h2 id="the-motivation">The Motivation</h2>
<p>After years of working as an engineer, I built a personal IT infrastructure (websites, emails, networks,&hellip;) to experiment, learn, and benefit. I maintain my own &ldquo;small&rdquo; data system. Because I am the developer, the user, and the investor at the same time, I have to keep it small and efficient. My &ldquo;small&rdquo; data system help me answer my own questions. I strongly believe that decisions should be backed by data, whether as an individual or an organization.</p>
<p>This year, during the Tet holiday in Vietnam, I had the opportunity to reconnect with friends. In one of our discussions, I showed him one of my reports. My friend was genuinely excited about it and asked me to share it with him. So that he can use the report for his own decision. At that moment, I realized that my data could benefit others, not just myself.</p>
<p>That is the story led to the creation of <a href="https://analytics.datengineer.dev/">Dat⋂nalytics</a> - a place where I could openly share interactive reports, making data more accessible and engaging for everyone.</p>
<h2 id="development">Development</h2>
<p>I am neither a designer nor a web developer. The last time I wrote some basic HTML and CSS was over ten years ago, so creating a stunning, visually rich website was beyond my capabilities. However, I still wanted a functional and elegant way to share my reports.</p>
<p>I decided to use Hugo, a fast and flexible static site generator. With my basic knowledge of HTML and CSS, I was still able to manage and customize it to fit my needs. I used the Quint theme as inspiration. However, Quint is not ready for my need, so I have to tweak a lot.</p>
<p>My reports will be hosted on Power BI, and I will embed them directly on my site. Users can still have a seamless and interactive experience as they are watching reports directly on Power BI service.</p>
<p>The result? A clean, lightweight site that seamlessly integrates Power BI reports while keeping performance in check.</p>
<p><img alt="An overview of Dat⋂nalytics" loading="lazy" src="/posts/introducing-analytics-datengineer/images/datanalytics-screenshot.png"></p>
<h2 id="join-me">Join me</h2>
<p>I am excited to share this next step in my journey with you. Currently, there is only 1 report. And there will be absolutely more reports coming. Check out <a href="https://analytics.datengineer.dev/">Dat⋂nalytics</a>. If you have feedbacks, ideas, and feature requests, don&rsquo;t hesitate to contact me. They will help shape this site into something even better.</p>
]]></content:encoded></item><item><title>PIVOT and Dynamic PIVOT in SQL - Advanced SQL for analytics</title><link>https://note.datengineer.dev/posts/pivot-and-dynamic-pivot-in-sql-advanced-sql-for-analytics/</link><pubDate>Sun, 19 Jan 2025 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/pivot-and-dynamic-pivot-in-sql-advanced-sql-for-analytics/</guid><description>Explore advanced SQL techniques static and dynamic PIVOT to transform and analyze data beyond the basic SELECT FROM WHERE queries. Learn how to apply them in real-world analytics.</description><content:encoded><![CDATA[<p>As a data engineer, a typical working day for me, apart from meetings, is full of <code>SELECT</code>, <code>FROM</code> and <code>WHERE</code>. But these basic statements are not enough, especially for the complex ad hoc analysis that is increasingly common nowadays.</p>
<p>SQL is a powerful language. It is a declarative language where we define what we want and the engine finds a way to achieve it. The language is evolving to adapt to the increasing variety of analysis needs. I wrote an article about an <a href="../recursive-ctes-and-connect-by-in-sql-to-query-hierarchical-data">advanced SQL feature to deal with hierarchical data</a>. And today, let&rsquo;s explore another beyond-the-basic feature: PIVOT.</p>
<h2 id="problem-statement">Problem Statement</h2>
<p>Imagine you are working as a data engineer for a retail company. The company wants to analyze product sales data to identify trends and opportunities for growth. The data is stored in a table called <code>Sales</code> with the following structure:</p>
<table>
  <thead>
      <tr>
          <th>ProductID</th>
          <th>Date</th>
          <th>Amount</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>101</td>
          <td>2024-01-10</td>
          <td>300</td>
      </tr>
      <tr>
          <td>101</td>
          <td>2024-12-15</td>
          <td>500</td>
      </tr>
      <tr>
          <td>101</td>
          <td>2025-01-15</td>
          <td>700</td>
      </tr>
      <tr>
          <td>101</td>
          <td>2025-02-01</td>
          <td>1100</td>
      </tr>
      <tr>
          <td>102</td>
          <td>2024-02-20</td>
          <td>800</td>
      </tr>
      <tr>
          <td>102</td>
          <td>2024-11-03</td>
          <td>400</td>
      </tr>
      <tr>
          <td>102</td>
          <td>2025-01-20</td>
          <td>900</td>
      </tr>
      <tr>
          <td>102</td>
          <td>2025-02-22</td>
          <td>650</td>
      </tr>
      <tr>
          <td>103</td>
          <td>2023-07-25</td>
          <td>1200</td>
      </tr>
      <tr>
          <td>103</td>
          <td>2024-08-15</td>
          <td>1500</td>
      </tr>
      <tr>
          <td>103</td>
          <td>2025-02-10</td>
          <td>1250</td>
      </tr>
      <tr>
          <td>104</td>
          <td>2023-12-05</td>
          <td>400</td>
      </tr>
      <tr>
          <td>104</td>
          <td>2024-06-30</td>
          <td>800</td>
      </tr>
      <tr>
          <td>104</td>
          <td>2025-01-30</td>
          <td>300</td>
      </tr>
      <tr>
          <td>104</td>
          <td>2025-02-25</td>
          <td>500</td>
      </tr>
  </tbody>
</table>
<p>This structure is not good for reports. The company wants this data served in a format where years are represented as columns for easier comparison across products.</p>
<h2 id="group-by---the-amateur-way">GROUP BY - The Amateur Way</h2>
<p>A very straightforward approach to this problem is to use the <code>GROUP BY</code> statement. We will group by <code>ProductID</code>, and we will get the sum column for each month. Below is a SQL Server example. Other SQL engines should have similar syntax.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">select</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">ProductID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">,</span><span class="k">sum</span><span class="p">(</span><span class="n">iif</span><span class="p">(</span><span class="k">year</span><span class="p">(</span><span class="nb">Date</span><span class="p">)</span><span class="o">=</span><span class="mi">2023</span><span class="p">,</span><span class="w"> </span><span class="n">Amount</span><span class="p">,</span><span class="w"> </span><span class="k">null</span><span class="p">))</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="p">[</span><span class="mi">2023</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">,</span><span class="k">sum</span><span class="p">(</span><span class="n">iif</span><span class="p">(</span><span class="k">year</span><span class="p">(</span><span class="nb">Date</span><span class="p">)</span><span class="o">=</span><span class="mi">2024</span><span class="p">,</span><span class="w"> </span><span class="n">Amount</span><span class="p">,</span><span class="w"> </span><span class="k">null</span><span class="p">))</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">,</span><span class="k">sum</span><span class="p">(</span><span class="n">iif</span><span class="p">(</span><span class="k">year</span><span class="p">(</span><span class="nb">Date</span><span class="p">)</span><span class="o">=</span><span class="mi">2025</span><span class="p">,</span><span class="w"> </span><span class="n">Amount</span><span class="p">,</span><span class="w"> </span><span class="k">null</span><span class="p">))</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="p">[</span><span class="mi">2025</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">from</span><span class="w"> </span><span class="n">Sales</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">group</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="n">ProductID</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div>

<p><details >
  <summary markdown="span"><em>GROUP BY query result</em></summary>
  <table>
  <thead>
      <tr>
          <th style="text-align: right">ProductID</th>
          <th style="text-align: right">2023</th>
          <th style="text-align: right">2024</th>
          <th style="text-align: right">2025</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: right">101</td>
          <td style="text-align: right"><em>null</em></td>
          <td style="text-align: right">800</td>
          <td style="text-align: right">1800</td>
      </tr>
      <tr>
          <td style="text-align: right">102</td>
          <td style="text-align: right"><em>null</em></td>
          <td style="text-align: right">1200</td>
          <td style="text-align: right">1550</td>
      </tr>
      <tr>
          <td style="text-align: right">103</td>
          <td style="text-align: right">1200</td>
          <td style="text-align: right">1500</td>
          <td style="text-align: right">1250</td>
      </tr>
      <tr>
          <td style="text-align: right">104</td>
          <td style="text-align: right">400</td>
          <td style="text-align: right">800</td>
          <td style="text-align: right">800</td>
      </tr>
  </tbody>
</table>

</details></p>

<p>At first glance, it&rsquo;s simple, and it works. Sometimes just working is enough.</p>
<h2 id="sql-pivot---the-complex-way">SQL PIVOT - The Complex Way</h2>
<p><code>PIVOT</code> is a operator in SQL that allows you to transform rows into columns. This transformation is particularly useful when summarizing data and creating a more interpretable format for analysis. The below SQL Server query achieves similar results:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">select</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">from</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">select</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">ProductID</span><span class="p">,</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">year</span><span class="p">(</span><span class="nb">Date</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">Amount</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">from</span><span class="w"> </span><span class="n">Sales</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">ToPivotSales</span><span class="w"> </span><span class="n">pivot</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">sum</span><span class="p">(</span><span class="n">Amount</span><span class="p">)</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="k">Year</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="p">([</span><span class="mi">2023</span><span class="p">],</span><span class="w"> </span><span class="p">[</span><span class="mi">2024</span><span class="p">],</span><span class="w"> </span><span class="p">[</span><span class="mi">2025</span><span class="p">])</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">PivotedSales</span><span class="w">
</span></span></span></code></pre></div>

<p><details >
  <summary markdown="span"><em>PIVOT query result</em></summary>
  <table>
  <thead>
      <tr>
          <th style="text-align: right">ProductID</th>
          <th style="text-align: right">2023</th>
          <th style="text-align: right">2024</th>
          <th style="text-align: right">2025</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: right">101</td>
          <td style="text-align: right"><em>null</em></td>
          <td style="text-align: right">800</td>
          <td style="text-align: right">1800</td>
      </tr>
      <tr>
          <td style="text-align: right">102</td>
          <td style="text-align: right"><em>null</em></td>
          <td style="text-align: right">1200</td>
          <td style="text-align: right">1550</td>
      </tr>
      <tr>
          <td style="text-align: right">103</td>
          <td style="text-align: right">1200</td>
          <td style="text-align: right">1500</td>
          <td style="text-align: right">1250</td>
      </tr>
      <tr>
          <td style="text-align: right">104</td>
          <td style="text-align: right">400</td>
          <td style="text-align: right">800</td>
          <td style="text-align: right">800</td>
      </tr>
  </tbody>
</table>

</details></p>

<p>To do <code>PIVOT</code>, we first need a subquery to specify the columns we need. In this case they are <code>ProductID</code>, <code>Year</code> and <code>Amount</code>. If you don&rsquo;t like subqueries, CTE (Common Table Expression) works as well.</p>
<p>In the <code>PIVOT</code> decalaration:</p>
<ul>
<li><code>sum(Amount)</code>: specifies that the aggregation function to be applied is <code>sum</code>, which will sum the <code>Amount</code> values.</li>
<li><code>for Year in ([2023], [2024], [2025])</code>: defines how the pivoting will occur:
<ul>
<li><code>for Year</code>: indicates that the values in the <code>Year</code> column will be used to create new columns in the result set.</li>
<li><code>in ([2023], [2024], [2025])</code>: specifies the specific years that will become the new columns in the result. Each of these years will have a corresponding column that contains the summed Amount for that year.</li>
</ul>
</li>
</ul>
<p>To be honest, I don&rsquo;t like the syntax of <code>PIVOT</code>. It is terrible to me. It involves subqueries and CTEs, it uses more levels of indentation, and it&rsquo;s harder to scan through. And if we look at the execution plan, it is not really faster than <code>GROUP BY</code>.</p>
<p>With that being said, <code>PIVOT</code> still has advantages over <code>GROUP BY</code> that it requires less boilerplate code. In the examples above, to add a new year to the query with <code>GROUP BY</code>, you have to copy, paste, and edit in 2 places. With <code>PIVOT</code>, all you have to do is add a new value to the list. This makes <code>PIVOT</code> shine in situations where we have to deal with a long list of values.</p>
<h2 id="dynamic-pivot---the-hacker-way">Dynamic PIVOT - The Hacker Way</h2>
<p>While both GROUP BY and PIVOT are useful, they have the limitation that you must explicitly specify the list of values. When the data is small, this is fine. But it becomes a problem when dealing with evolving, large data where you do not know or do not want to manually list all possible values. Imagine you build a report of annual sales, you clearly don&rsquo;t want to update the query every year.</p>
<p>Dynamic PIVOT is a technique that allows us to pivot data without hard-coding the pivot column values. It is not a standard SQL operation. Therefore, the syntax may vary between different SQL engines. In Snowflake SQL, you can achieve dynamic pivoting by something as simple as this.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="p">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="n">pivot</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">sum</span><span class="p">(</span><span class="n">Amount</span><span class="p">)</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="k">Year</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="p">(</span><span class="k">any</span><span class="w"> </span><span class="k">order</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="k">Year</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p>Most other SQL engines don&rsquo;t offer this level of simplicity. They require a bit of complexity. Below is a SQL Server query example.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">declare</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="o">@</span><span class="n">cols</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">nvarchar</span><span class="p">(</span><span class="k">max</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="o">@</span><span class="n">query</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">nvarchar</span><span class="p">(</span><span class="k">max</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">select</span><span class="w"> </span><span class="o">@</span><span class="n">cols</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">string_agg</span><span class="p">(</span><span class="n">quotename</span><span class="p">(</span><span class="k">Year</span><span class="p">),</span><span class="w"> </span><span class="s1">&#39;,&#39;</span><span class="p">)</span><span class="w"> </span><span class="n">within</span><span class="w"> </span><span class="k">group</span><span class="w"> </span><span class="p">(</span><span class="k">order</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="k">Year</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">from</span><span class="w"> </span><span class="p">(</span><span class="k">select</span><span class="w"> </span><span class="k">distinct</span><span class="w"> </span><span class="k">year</span><span class="p">(</span><span class="nb">Date</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="k">Year</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="n">Sales</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">YearList</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">set</span><span class="w"> </span><span class="o">@</span><span class="n">query</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s1">&#39;select *
</span></span></span><span class="line"><span class="cl"><span class="s1">    from (
</span></span></span><span class="line"><span class="cl"><span class="s1">        select ProductID, year(Date) as Year, Amount
</span></span></span><span class="line"><span class="cl"><span class="s1">        from Sales
</span></span></span><span class="line"><span class="cl"><span class="s1">    ) as ToPivotSales pivot (
</span></span></span><span class="line"><span class="cl"><span class="s1">        sum(Amount) for Year in (&#39;</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="o">@</span><span class="n">cols</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s1">&#39;)
</span></span></span><span class="line"><span class="cl"><span class="s1">    ) as PivotedSales;&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">exec</span><span class="w"> </span><span class="n">sp_executesql</span><span class="w"> </span><span class="o">@</span><span class="n">query</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div>

<p><details >
  <summary markdown="span"><em>Dynamic PIVOT query result</em></summary>
  <table>
  <thead>
      <tr>
          <th style="text-align: right">ProductID</th>
          <th style="text-align: right">2023</th>
          <th style="text-align: right">2024</th>
          <th style="text-align: right">2025</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: right">101</td>
          <td style="text-align: right"><em>null</em></td>
          <td style="text-align: right">800</td>
          <td style="text-align: right">1800</td>
      </tr>
      <tr>
          <td style="text-align: right">102</td>
          <td style="text-align: right"><em>null</em></td>
          <td style="text-align: right">1200</td>
          <td style="text-align: right">1550</td>
      </tr>
      <tr>
          <td style="text-align: right">103</td>
          <td style="text-align: right">1200</td>
          <td style="text-align: right">1500</td>
          <td style="text-align: right">1250</td>
      </tr>
      <tr>
          <td style="text-align: right">104</td>
          <td style="text-align: right">400</td>
          <td style="text-align: right">800</td>
          <td style="text-align: right">800</td>
      </tr>
  </tbody>
</table>

</details></p>

<p>Although the code is not pleasing to eyes, the concept is simple. We need an additional step to find all the unique values and store them in a variable. Then we construct the query with the above list of unique values. Execute it and we get the expected result.</p>
<h2 id="final-thought">Final Thought</h2>
<p>Mastering SQL is not easy. SQL is more than just a query language; it&rsquo;s a powerful tool that, when used effectively, can turn raw data into actionable intelligence. Advanced SQL techniques such as <code>PIVOT</code> are not very common in our day-to-day work. But we should know what we have in our toolbox. So that we can quickly select the right tool from the toolbox where the right job applies.</p>
<p><em>* You can find the execution of SQL query examples in this post at <a href="https://dbfiddle.uk/2jD1lHkL" rel="nofollow"> https://dbfiddle.uk/2jD1lHkL </a>
</em></p>
]]></content:encoded></item><item><title>How to create Azure DevOps Pull Requests reporting with Power BI</title><link>https://note.datengineer.dev/posts/how-to-create-azure-devops-pull-requests-reporting-with-power-bi/</link><pubDate>Sun, 18 Aug 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/how-to-create-azure-devops-pull-requests-reporting-with-power-bi/</guid><description>Gain insights from your Azure DevOps data with this step-by-step guide to building a comprehensive pull request report using Power BI.</description><content:encoded><![CDATA[<p>As a developer, I have always emphasized the importance of code quality and efficient development processes. Modern Git workflows are typically about writing code, commits, pull requests, code reviews, and merges. To gain deeper insight into these processes, I decided to create a Power BI report to track them. My goal is to identify bottlenecks, areas for improvement, and opportunities to streamline our workflow.</p>
<h2 id="pre-requisites">Pre-requisites</h2>
<p>Before we dive into building the Power BI report, you must have Power BI Desktop of course. It is necessary to have a Personal Access Token that has sufficient access to the project repositories. You will need it to authenticate the API calls from Power BI.</p>
<h2 id="parameters">Parameters</h2>
<p>To make the report work with different settings, we will use parameters. These parameters allow you to easily apply my code to your project. Just copy the code and edit the following parameters:</p>
<ul>
<li><code>_organization</code>: The Azure DevOps organization</li>
<li><code>_project</code>: Your project. The report will retrieve pull requests from all repositories in the project.</li>
<li><code>_top</code>: The number of most recent pull requests you want to analyze in the report.</li>
</ul>
<h2 id="build-the-power-query">Build the Power Query</h2>
<h3 id="fetch-data-from-azure-devops">Fetch data from Azure DevOps</h3>
<p>Now that you have set up your Power BI report set up with parameters and have prepared the necessary credentials, it&rsquo;s time to pull data from Azure DevOps. While Power BI has a built-in Azure DevOps connector, it only provides board data. To retrieve pull request information, we will need to access the <a href="https://learn.microsoft.com/en-us/rest/api/azure/devops/git/pull-requests?view=azure-devops-rest-7.1" rel="nofollow">Azure DevOps REST APIs</a>
.</p>
<p>See the following Power BI M query:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">Source</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Json</span><span class="p">.</span><span class="n">Document</span><span class="p">(</span><span class="n">Web</span><span class="p">.</span><span class="n">Contents</span><span class="p">(</span><span class="s2">&#34;https://dev.azure.com/&#34;</span><span class="o">&amp;</span><span class="n">_organization</span><span class="o">&amp;</span><span class="s2">&#34;/&#34;</span><span class="o">&amp;</span><span class="n">_project</span><span class="o">&amp;</span><span class="s2">&#34;/_apis/git/pullrequests?searchCriteria.includeLinks=true&amp;searchCriteria.status=all&amp;$top=&#34;</span><span class="o">&amp;</span><span class="n">_top</span><span class="o">&amp;</span><span class="s2">&#34;&amp;api-version=7.1-preview.1&#34;</span><span class="p">)),</span><span class="w">
</span></span></span></code></pre></div><p>The <code>Web.Contents</code> function will pull data from the REST API and return a <code>binary</code>. The <code>Json.Document</code> function will grab this binary and parse it to json. After this step, you should have <code>source</code> as a <code>record</code> which has two attributes:</p>
<ul>
<li><code>value</code>: a list of all pull request records.</li>
<li><code>count</code>: the length of the <code>value</code> list.</li>
</ul>
<h3 id="convert-to-table">Convert to Table</h3>
<p>Our previous step resulted in a JSON record containing the pull request data. To make this data available for further analysis, we need to convert it to a table.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="s2">&#34;Converted to Table&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">FromRecords</span><span class="p">(</span><span class="err">{</span><span class="k">Source</span><span class="err">}</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;value&#34;</span><span class="err">}</span><span class="p">),</span><span class="w">
</span></span></span></code></pre></div><p>The above query convert <code>value</code> to a table in Power BI. The returned table has only one column and one row like below:</p>
<table>
  <thead>
      <tr>
          <th>value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>List</td>
      </tr>
  </tbody>
</table>
<p>To make the table usable, we need to further transform it. First, we want to explode the list to rows</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="s2">&#34;Expanded value&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">ExpandListColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Converted to Table&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value&#34;</span><span class="p">),</span><span class="w">
</span></span></span></code></pre></div><p>And for each row, we want to expand the record to columns. Note that we don&rsquo;t necessarily need all the columns. The M query below extracts only the columns we need.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="s2">&#34;Expanded value1&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">ExpandRecordColumn</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="o">#</span><span class="s2">&#34;Expanded value&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s2">&#34;value&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="err">{</span><span class="s2">&#34;repository&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;pullRequestId&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;codeReviewId&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;status&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;createdBy&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;creationDate&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;closedDate&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;title&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;description&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;sourceRefName&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;targetRefName&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;mergeStatus&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;isDraft&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;mergeId&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;reviewers&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;labels&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;url&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;completionOptions&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;supportsIterations&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;completionQueueTime&#34;</span><span class="err">}</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="err">{</span><span class="s2">&#34;value.repository&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.pullRequestId&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.codeReviewId&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.status&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.createdBy&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.creationDate&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.closedDate&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.title&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.description&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.sourceRefName&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.targetRefName&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.mergeStatus&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.isDraft&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.mergeId&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.reviewers&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.labels&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.url&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.completionOptions&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.supportsIterations&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.completionQueueTime&#34;</span><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">),</span><span class="w">
</span></span></span></code></pre></div><h3 id="continue-expanding-columns">Continue expanding columns</h3>
<p>Even though the previous steps gave us a solid starting point, some columns still have nested records full of useful data. We will perform additional expansions to access this data.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="cm">/*
</span></span></span><span class="line"><span class="cl"><span class="cm">value.repository, value.createdBy, value.completionOptions are records, we can expand them into columns
</span></span></span><span class="line"><span class="cl"><span class="cm">*/</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="o">#</span><span class="s2">&#34;Expanded value.repository&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">ExpandRecordColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Expanded value1&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.repository&#34;</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;name&#34;</span><span class="err">}</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;value.repository.name&#34;</span><span class="err">}</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="o">#</span><span class="s2">&#34;Expanded value.createdBy&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">ExpandRecordColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Expanded value.repository&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.createdBy&#34;</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;displayName&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;id&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;uniqueName&#34;</span><span class="err">}</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;value.createdBy.displayName&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.createdBy.id&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.createdBy.uniqueName&#34;</span><span class="err">}</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="o">#</span><span class="s2">&#34;Expanded value.completionOptions&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">ExpandRecordColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Expanded value.createdBy&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.completionOptions&#34;</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;mergeCommitMessage&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;mergeStrategy&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;transitionWorkItems&#34;</span><span class="err">}</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;value.completionOptions.mergeCommitMessage&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.completionOptions.mergeStrategy&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;value.completionOptions.transitionWorkItems&#34;</span><span class="err">}</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="cm">/*
</span></span></span><span class="line"><span class="cl"><span class="cm">value.reviewers is otherwise a list of records. For each list, we will concat all displayName of each record
</span></span></span><span class="line"><span class="cl"><span class="cm">*/</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="o">#</span><span class="s2">&#34;Expanded value.reviewers&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">TransformColumns</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Expanded value.completionOptions&#34;</span><span class="p">,</span><span class="w"> </span><span class="err">{{</span><span class="s2">&#34;value.reviewers&#34;</span><span class="p">,</span><span class="w"> </span><span class="k">each</span><span class="w"> </span><span class="n">Combiner</span><span class="p">.</span><span class="n">CombineTextByDelimiter</span><span class="p">(</span><span class="s2">&#34;, &#34;</span><span class="p">)(</span><span class="n">List</span><span class="p">.</span><span class="k">Transform</span><span class="p">(,</span><span class="w"> </span><span class="k">each</span><span class="w"> </span><span class="p">[</span><span class="n">displayName</span><span class="p">]))</span><span class="err">}}</span><span class="p">),</span><span class="w">
</span></span></span></code></pre></div><h3 id="add-details-from-other-apis">Add details from other APIs</h3>
<p>While the pull request endpoint provides us with a lot of useful information, it might not be enough. We often need to supplement our data with information from other Azure DevOps APIs to gain deeper insights. The process is pretty similar with what we have done so far: pulling data from API and expanding JSON objects.</p>
<h4 id="iterations">Iterations</h4>
<p>Iterations are created as a result of creating and pushing updates to a pull request. The number of iterations is equal to the number of updates made after pull requests are created. Below is the Power BI M query to get the number of iterations for each pull request:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="s2">&#34;Added iterations&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">AddColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Expanded value.reviewers&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;iterations&#34;</span><span class="p">,</span><span class="w"> </span><span class="k">each</span><span class="w"> </span><span class="n">Json</span><span class="p">.</span><span class="n">Document</span><span class="p">(</span><span class="n">Web</span><span class="p">.</span><span class="n">Contents</span><span class="p">([</span><span class="n">value</span><span class="p">.</span><span class="n">url</span><span class="p">]</span><span class="o">&amp;</span><span class="s2">&#34;/iterations/&#34;</span><span class="p">))),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="o">#</span><span class="s2">&#34;Expanded iterations&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">ExpandRecordColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Added iterations&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;iterations&#34;</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;count&#34;</span><span class="err">}</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;iterations.count&#34;</span><span class="err">}</span><span class="p">),</span><span class="w">
</span></span></span></code></pre></div><h4 id="changes">Changes</h4>
<p>Another good metric to track is the number of files changed in each pull request. And we need to have the changes in all iterations, not just the initial pull request. Below is the code to retrieve the data from the API and extract the required information.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="s2">&#34;Added iterations.changes&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">AddColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Expanded iterations&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;iterations.changes&#34;</span><span class="p">,</span><span class="w"> </span><span class="k">each</span><span class="w"> </span><span class="n">Json</span><span class="p">.</span><span class="n">Document</span><span class="p">(</span><span class="n">Web</span><span class="p">.</span><span class="n">Contents</span><span class="p">([</span><span class="n">value</span><span class="p">.</span><span class="n">url</span><span class="p">]</span><span class="o">&amp;</span><span class="s2">&#34;/iterations/&#34;</span><span class="o">&amp;</span><span class="nb">Number</span><span class="p">.</span><span class="n">ToText</span><span class="p">([</span><span class="n">iterations</span><span class="p">.</span><span class="k">count</span><span class="p">])</span><span class="o">&amp;</span><span class="s2">&#34;/changes?api-version=7.1-preview.1&#34;</span><span class="p">))),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="o">#</span><span class="s2">&#34;Expanded iterations.changes&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">ExpandRecordColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Added iterations.changes&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;iterations.changes&#34;</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;changeEntries&#34;</span><span class="err">}</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;iterations.changes.changeEntries&#34;</span><span class="err">}</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="o">#</span><span class="s2">&#34;Added iterations.changes.changeEntries.count&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">AddColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Expanded iterations.changes&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;iterations.changes.changeEntries.count&#34;</span><span class="p">,</span><span class="w"> </span><span class="k">each</span><span class="w"> </span><span class="n">List</span><span class="p">.</span><span class="k">Count</span><span class="p">([</span><span class="n">iterations</span><span class="p">.</span><span class="n">changes</span><span class="p">.</span><span class="n">changeEntries</span><span class="p">])),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="o">#</span><span class="s2">&#34;Removed iterations.changes.changeEntries&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">RemoveColumns</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Added iterations.changes.changeEntries.count&#34;</span><span class="p">,</span><span class="err">{</span><span class="s2">&#34;iterations.changes.changeEntries&#34;</span><span class="err">}</span><span class="p">),</span><span class="w">
</span></span></span></code></pre></div><h4 id="threads">Threads</h4>
<p>Threads are an Azure DevOps object for managing and organizing pull request discussions. Team can discuss specific changes directly by adding one or more comments to each thread. Analyzing threads can give us many useful insights.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="s2">&#34;Added threads&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">AddColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Removed iterations.changes.changeEntries&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;threads&#34;</span><span class="p">,</span><span class="w"> </span><span class="k">each</span><span class="w"> </span><span class="n">Json</span><span class="p">.</span><span class="n">Document</span><span class="p">(</span><span class="n">Web</span><span class="p">.</span><span class="n">Contents</span><span class="p">([</span><span class="n">value</span><span class="p">.</span><span class="n">url</span><span class="p">]</span><span class="o">&amp;</span><span class="s2">&#34;/threads?api-version=7.1-preview.1&#34;</span><span class="p">))),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="o">#</span><span class="s2">&#34;Expanded threads&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">ExpandRecordColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Added threads&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;threads&#34;</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;value&#34;</span><span class="err">}</span><span class="p">,</span><span class="w"> </span><span class="err">{</span><span class="s2">&#34;threads.value&#34;</span><span class="err">}</span><span class="p">),</span><span class="w">
</span></span></span></code></pre></div><p>For example, we can count the comment threads. A comment thread should have the <code>status</code> attribute (<code>Active</code>, <code>Resolved</code>, <code>Closed</code>)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="s2">&#34;Added threads.value.commentCount&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">AddColumn</span><span class="p">(</span><span class="o">#</span><span class="s2">&#34;Expanded threads&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;threads.value.commentCount&#34;</span><span class="p">,</span><span class="w"> </span><span class="k">each</span><span class="w"> </span><span class="n">List</span><span class="p">.</span><span class="k">Sum</span><span class="p">(</span><span class="n">List</span><span class="p">.</span><span class="k">Transform</span><span class="p">([</span><span class="n">threads</span><span class="p">.</span><span class="n">value</span><span class="p">],</span><span class="w"> </span><span class="k">each</span><span class="w"> </span><span class="nb">Number</span><span class="p">.</span><span class="k">From</span><span class="p">(</span><span class="n">Record</span><span class="p">.</span><span class="n">HasFields</span><span class="p">(</span><span class="n">_</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;status&#34;</span><span class="p">))))),</span><span class="w">
</span></span></span></code></pre></div><p>Or we can get the approval or rejection information from the vote thread. A vote thread has a <code>CodeReviewThreadType</code> attribute with value <code>VoteUpdate</code>. And if the value of <code>CodeReviewVoteResult</code> is greater than 0, it is an approval. Otherwise, it is a rejection. The below M query get the fist approval time of a pull request.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="s2">&#34;Added threads.value.firstApprovalTime&#34;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">Table</span><span class="p">.</span><span class="n">AddColumn</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="o">#</span><span class="s2">&#34;Added threads.value.commentCount&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s2">&#34;threads.value.firstApprovalTime&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">each</span><span class="w"> </span><span class="n">List</span><span class="p">.</span><span class="k">Min</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">List</span><span class="p">.</span><span class="k">Transform</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="p">[</span><span class="n">threads</span><span class="p">.</span><span class="n">value</span><span class="p">],</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="k">each</span><span class="w"> </span><span class="k">if</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                </span><span class="n">Record</span><span class="p">.</span><span class="n">HasFields</span><span class="p">(</span><span class="n">_</span><span class="p">[</span><span class="n">properties</span><span class="p">],</span><span class="w"> </span><span class="s2">&#34;CodeReviewThreadType&#34;</span><span class="p">)</span><span class="w"> </span><span class="k">and</span><span class="w"> </span><span class="n">Record</span><span class="p">.</span><span class="n">Field</span><span class="p">(</span><span class="n">_</span><span class="p">[</span><span class="n">properties</span><span class="p">][</span><span class="n">CodeReviewThreadType</span><span class="p">],</span><span class="w"> </span><span class="s2">&#34;$value&#34;</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&#34;VoteUpdate&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                </span><span class="k">and</span><span class="w"> </span><span class="n">Record</span><span class="p">.</span><span class="n">HasFields</span><span class="p">(</span><span class="n">_</span><span class="p">[</span><span class="n">properties</span><span class="p">],</span><span class="w"> </span><span class="s2">&#34;CodeReviewVoteResult&#34;</span><span class="p">)</span><span class="w"> </span><span class="k">and</span><span class="w"> </span><span class="nb">Number</span><span class="p">.</span><span class="n">FromText</span><span class="p">(</span><span class="n">Record</span><span class="p">.</span><span class="n">Field</span><span class="p">(</span><span class="n">_</span><span class="p">[</span><span class="n">properties</span><span class="p">][</span><span class="n">CodeReviewVoteResult</span><span class="p">],</span><span class="w"> </span><span class="s2">&#34;$value&#34;</span><span class="p">))</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">0</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="k">then</span><span class="w"> </span><span class="n">_</span><span class="p">[</span><span class="n">publishedDate</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="k">else</span><span class="w"> </span><span class="k">null</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">),</span><span class="w">
</span></span></span></code></pre></div><h2 id="full-source-code">Full source code</h2>
<p>You can grab the source code, paste it into the Power BI Power Query advanced editor, and customize it to suit your needs.</p>
<p><a href="https://gist.github.com/ThaiDat/9aa1f08ea1a1339973566325b1cf9af9">Full Query</a></p>
<h2 id="visualize-insights">Visualize insights</h2>
<p>Now we have a rich dataset. Power BI offers a wide range of visual elements to help you uncover trends, patterns, and insights. It&rsquo;s time to bring our data to life with stunning visualizations.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Remember, this is just the beginning. As your project evolves and your data grows, you can expand your report to include additional metrics, refine visualizations, and explore new insights. Continuous improvement is essential to maximizing the value of your data.</p>
<p>By creating a comprehensive pull request report, you are taking the initial step toward establishing a culture of data-driven decision-making, first within your development team, then throughout your organization.</p>
]]></content:encoded></item><item><title>How to start a successful Data Warehouse project</title><link>https://note.datengineer.dev/posts/how-to-build-a-successful-data-warehouse-project/</link><pubDate>Sun, 11 Aug 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/how-to-build-a-successful-data-warehouse-project/</guid><description>In this article on the key factors for launching a successful Data Warehouse project, we will explore key considerations that can help ensure that your Data Warehouse achieves its intended goals and delivers value to the organization.</description><content:encoded><![CDATA[<p>Any organization aiming to leverage the power of data-driven decision-making stands to benefit greatly from a successful Data Warehouse project. A well-designed Data Warehouse not only centralizes your data but also guarantees that it is reliable, scalable, maintainable, and usable by stakeholders.</p>
<p>Over the past few months, my team and I have launched a new Data Warehouse project in production. The opportunity to start from scratch is always a valuable chance to gain new insights and expertise. I would like to share the experiences from this success story in the hope that they will be as beneficial to others as they have been to us.</p>
<h2 id="understand-business-requirements">Understand Business Requirements</h2>
<p>The first step in starting any project, not only a Data Warehouse, is to fully understand the business requirements. This is the difference between success and failure, not just a formality. If you skip this step, I can tell you with certainty that your project will be a waste of time, energy, and resources.</p>
<p>To really understand what business wants to see and what your team needs to do, it&rsquo;s essential to spend time talking to the people who will be using the Data Warehouse. What do they hope to accomplish? How will it help them do their jobs better? How do they plan to use the data? Getting a clear picture of their goals is essential to making sure your project is on the right track.</p>
<p><img alt="Importance of a clear requirements in Data Warehouse project success" loading="lazy" src="/posts/how-to-build-a-successful-data-warehouse-project/images/business-requirements.png"></p>
<p>However, this is where things often get complicated. People usually do not understand each other, especially people in different departments who have different perspectives, priorities, and terminologies. <strong>Sometimes people do not even understand what they are saying.</strong> Business guys are the ones who are easily attracted to marketing buzzwords on the Internet believing that these terms are the solutions to their problems. I have to say that the marketing departments of data companies do a really good job of re-inventing new names for the similar term. During this project, there were dozens of times the guys told me let&rsquo;s use this tool, why not use this technology, money is not a problem (until they actually got the bill).</p>
<p>In one of my previous projects, a stakeholder told me that he wanted a visually stunning real-time dashboard that would make the numbers dance instantly whenever users did something in the web application. And I had to explain to him:</p>
<ul>
<li>Visually stunning: Yes, the data analysts team can always help you with that.</li>
<li>Real-time: There is no real time. If the sun disappears, we can know it only after 8 minutes. So does the data.</li>
<li>We do not really need it. Business is not going to sit still and watch the numbers dance every second.</li>
</ul>
<p>Patience is the key. They do not understand those technical buzzwords. Yes. But isn&rsquo;t that why you are here as a technical specialist? Your responsibility is to listen to them, understand them, empathize with them, and tell them what you will do to help them. Your job is to translate their requirements into a workable solution.</p>
<p>Remember that the business stakeholders are not only the end users, but also the investors. Without their buy-in, the project can&rsquo;t even get off the ground. They are funding the project, and they deserve the best service.</p>
<p>By starting with a clear understanding of business requirements, you set the stage for a Data Warehouse project that is aligned with the organization&rsquo;s goals, ensuring that the final product delivers real value.</p>
<h2 id="understand-system">Understand System</h2>
<p>A Data Warehouse is not an isolated island. It is more like a bustling city that relies on a network of interconnected systems. It receives supplies from surrounding farms and industrial areas. Since Data Warehouse pulls data from other systems, you can not build a successful Data Warehouse without understanding how the other systems work.</p>
<p>Imagine stakeholders telling you they want the sales figure. Then you need to know exactly which systems hold the sales number. How is that number populated in each system? It may be manually entered by users, it may be automatically calculated, it may be synchronized from other sources, it may be read-only or editable&hellip; You need to know all the surrounding information to decide the source of truth for the number we desire. You may argue that all you need to do is copy the source database over and the business will know what to do with the data. Believe me, they don&rsquo;t. In fact, they have never seen the database a day in their lives. And you are the one who will tell them what they can do with your Data Warehouse.</p>
<p>And not knowing how the system works also risks your project design. You certainly don&rsquo;t want to discover a surprise when you&rsquo;re almost done with the implementation, such as a scheduled job that archives data from the database daily. If you had known that from the beginning, your design would have been very different.</p>
<p>Understanding the entire system in detail can be time-consuming. You should have a good sense of how the interconnected systems work together, but don&rsquo;t expect to understand them in detail at the beginning of your project. Instead, I would suggest building strong relationships with the teams responsible for maintaining these systems. Meet with them, tell them what you are doing, and ask for their advice and insights. They are a goldmine of information. You can also experiment with sandbox environments and databases to uncover hidden patterns and processes.</p>
<h2 id="design-a-reliable-data-warehouse">Design a reliable Data Warehouse</h2>
<p><a href="../what-is-a-reliable-data-system">Reliability is the backbone of any Data Warehouse</a>. If your business can&rsquo;t rely on the data coming out of your Data Warehouse, your project is completely a failure.</p>
<p>Having a solid testing strategy will greatly help. Testing is not just about finding bugs, it&rsquo;s about building confidence. When you start designing the data warehouse, think less about the time when the system is running happily, there is nothing for us to do if the system keeps running as it should. Think more about the time when the system is not working and what we are going to do in that time.</p>
<p><img alt="Bug is inevitable. The importance is how you deal with it." loading="lazy" src="/posts/how-to-build-a-successful-data-warehouse-project/images/there-will-be-no-bug-if-you-dont-write-any-code.png"></p>
<p>And even if you do your best, bugs and issues will still happen. Don&rsquo;t expect your system to be bug-free; instead, build processes to handle issues as soon as they arise. And most importantly, be transparent. If the business comes to you and asks about an issue they found, tell them what happened and what you are doing to help. Transparency is the key to trust. <strong>If you tell a lie, you are part of the problem; if you are transparent, you are part of the solution.</strong> A reliable Data Warehouse isn&rsquo;t just about technology. It&rsquo;s about building trust.</p>
<h2 id="choose-the-right-tool-for-the-right-job">Choose the right tool for the right job</h2>
<p>To build a Data Warehouse, you need a toolbox filled with different pieces to complete the picture: tools for copying data, transforming it, orchestrating jobs, and more. It is technically possible to create the tools yourself, especially if you are in a big corporation and want to control every aspect of the technology. However, in most of the cases, it is impractical. You do not have enough resources to own the technology. Thus, developing a Data Warehouse solution usually means picking the available tools and services and making them work together.</p>
<p>The real challenge is choosing the right tools. Beware of your enemies, the shiny marketing promises. The person who writes those buzzwords may not be the one who writes the code. Sometimes I don&rsquo;t understand what they wrote, and I think they don&rsquo;t understand what they wrote either. These tools are very expensive. It is important to avoid overkill. Focus on what your business really needs, not just what sounds cool. We are not going to use the most popular or the most expensive tools; we are going to find the right fit for our specific needs.</p>
<h2 id="start-small-grow-big">Start small, Grow big</h2>
<p>Your investors do not have infinite patience. They want to see progress and value. Building something small but functional is far better than promising a grand project that never finish. By starting small, you can quickly deliver value and gather feedback from users.</p>
<p>With limited resources, we can not get everything done at once. It is important to prioritize. What matters most to your business? What will deliver the biggest impact to your customers? Concentrate on delivering those core features first. You can break the project into phases, which is a good practice. Each phase focuses on specific business requirements, data sources, or user groups. And you can gradually expand the capabilities of the Data Warehouse.</p>
<h2 id="engage-users">Engage users</h2>
<p>A Data Warehouse is not just a technical marvel. It is a tool for your business. To ensure it delivers maximum value, you need to involve your users from the very beginning.</p>
<p>Imagine building a house without consulting the people who will live in it. People can still live in it, but they never feel it is their home. By involving them early and often, you will gain valuable insight into their needs, expectations, and challenges.</p>
<p>How can you engage your users?</p>
<ul>
<li>Involve them in the planning phase: Understand their data needs, pain points, and desired outcomes.</li>
<li>Provide regular updates: Keep them informed about project progress and involve them in decision-making.</li>
<li>Offer training and support: Equip users with the skills to effectively use the Data Warehouse.</li>
<li>Gather feedback: Encourage users to share their thoughts and suggestions for improvement.</li>
</ul>
<p>Remember that if you can not engage your users, any slightly higher number in their reports will quickly become <strong>your</strong> problem. <strong>If you can engage them and make them feel like they are part of the project, then any issue will become everyone&rsquo;s problem.</strong></p>
<h2 id="conclusion">Conclusion</h2>
<p>Building a successful Data Warehouse is a challenging journey that requires careful planning, execution, and continuous improvement. It all starts with a deep understanding of the business requirements to ensure that every decision is aligned with the organization&rsquo;s goals. Start small, iterate often, and always keep the user at the center of your efforts. A successful Data Warehouse is a collaboration between the engineering team and the business. By working together, you can create a solution that truly delivers value.</p>
]]></content:encoded></item><item><title>Understand Row-Oriented vs Column-Oriented Storage</title><link>https://note.datengineer.dev/posts/understand-row-oriented-vs-column-oriented-storage/</link><pubDate>Fri, 05 Apr 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/understand-row-oriented-vs-column-oriented-storage/</guid><description>Explore the basics and benefits of column-oriented storage, and learn about its advantages over row-oriented databases in processing OLAP queries.</description><content:encoded><![CDATA[<p>The way we access and analyze data has changed a lot lately. Row-oriented storage, which has been the standard for data storage for a long time, is having trouble keeping up with the demands of modern data analysis. In this article, I will introduce you to column-oriented storage and how it can help analytical queries run faster.</p>
<h2 id="olap">OLAP</h2>
<p>In my previous post, we discussed the <a href="../oltp-olap-why-we-need-data-warehouse">differences between Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP)</a>. As a reminder, OLAP which is the access pattern of analytical queries typically:</p>
<ul>
<li>Consume a large number of records</li>
<li>Focus on only a specific subset of columns from each record</li>
<li>Aggregate data to calculate statistics (e.g., averages, sums)</li>
</ul>
<h2 id="row-oriented-storage">Row-Oriented Storage</h2>
<p>Row-oriented storage, a type of storage engine optimized for OLTP, stores all values belonging to a single row near each other. The entire row is essentially stored as a sequence of bytes and is usually indexed for quick retrieval. When you provide a key, the database efficiently locates the physical location of the row on disk. It then goes to that address, loads the sequence of bytes into memory, and parses it to extract the specific values you need. Let&rsquo;s think of it like a csv file. A row is stored as a string of characters. If you want to access the 10th row, you have to scan the file for the 10th line break, read all the characters until you reach the next line break. Now you have to parse the result by separating it by commas to get the information you want.</p>
<p>While row-oriented storage is great for reading and writing individual records, it quickly becomes less suitable when faced with the demands of OLAP:</p>
<ul>
<li>Index, the data structure behind the ability of most row-oriented data storage to quickly locate the data, doesn&rsquo;t work with analytical queries. Because analytical queries are OLAP, they don&rsquo;t access data using a specific key or ID. Instead, they often use multiple conditions, such as date created within a year or product category is of some specific types. Any column in the table can be used in the <code>where</code> clause, and we can&rsquo;t just create a separate row-based index for each column.</li>
<li>Reading a single row in row-oriented storage requires loading the entire sequence of bytes from disk into memory. Thus, reading a huge number of rows with hundreds of columns (which is typical in OLAP) quickly becomes inefficient.</li>
</ul>
<h2 id="column-oriented-storage">Column-Oriented Storage</h2>
<p>Column-oriented storage is based on a simple idea: instead of storing all the values from one row together, just store all the values from each column together. Because the data is organized by column, a query only needs to access and process the columns that are relevant to its needs. This significantly reduces the amount of data that needs to be transferred and parsed, resulting in dramatic performance gains.</p>
<p>Let&rsquo;s look at the example below. A sales table stored in a row-oriented format looks like this.</p>
<table>
  <thead>
      <tr>
          <th>DATE</th>
          <th>PRODUCT_KEY</th>
          <th>CUSTOMER_KEY</th>
          <th>QUANTITY</th>
          <th>DISCOUNT</th>
          <th>PAYMENT_METHOD</th>
          <th>&hellip;</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2023/12/28</td>
          <td>2</td>
          <td>13</td>
          <td>3</td>
          <td>0.00</td>
          <td>card</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>2024/01/11</td>
          <td>2</td>
          <td>49</td>
          <td>5</td>
          <td>0.00</td>
          <td>bank</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>2024/01/16</td>
          <td>8</td>
          <td>49</td>
          <td>101</td>
          <td>15.00</td>
          <td>card</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>2024/01/21</td>
          <td>6</td>
          <td>55</td>
          <td>5</td>
          <td>5.00</td>
          <td>card</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>2024/02/02</td>
          <td>5</td>
          <td>26</td>
          <td>2</td>
          <td>0.00</td>
          <td>bank</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
      </tr>
  </tbody>
</table>
<p>A column-oriented storage serializes all values in a column and store them together (as a sequence of bytes). For our example table, the data would be stored in this way:</p>
<table>
  <thead>
      <tr>
          <th>Column</th>
          <th>Row 1</th>
          <th>Row 2</th>
          <th>Row 3</th>
          <th>Row 4</th>
          <th>Row 5</th>
          <th>&hellip;</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DATE</td>
          <td>2023/12/28</td>
          <td>2024/01/11</td>
          <td>2024/01/16</td>
          <td>2024/01/21</td>
          <td>2024/02/02</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>PRODUCT_KEY</td>
          <td>2</td>
          <td>2</td>
          <td>8</td>
          <td>6</td>
          <td>5</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>CUSTOMER_KEY</td>
          <td>13</td>
          <td>49</td>
          <td>49</td>
          <td>55</td>
          <td>26</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>QUANTITY</td>
          <td>3</td>
          <td>5</td>
          <td>101</td>
          <td>5</td>
          <td>2</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>DISCOUNT</td>
          <td>0.00</td>
          <td>0.00</td>
          <td>15.00</td>
          <td>5.00</td>
          <td>0.00</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>PAYMENT_METHOD</td>
          <td>card</td>
          <td>bank</td>
          <td>card</td>
          <td>card</td>
          <td>bank</td>
          <td>&hellip;</td>
      </tr>
      <tr>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
      </tr>
  </tbody>
</table>
<h2 id="advantages-of-column-oriented-storage">Advantages of Column-Oriented Storage</h2>
<p>Reading data from column-oriented storage provides several key advantages over traditional row-oriented storage, especially for analytical workloads:</p>
<ul>
<li><strong>Column compression</strong>: Due to the denormalization nature in modern data warehouse, values in a column tend to be repeated. Many popular compression algorithms, such as LZW or run-length encoding, make use of the similarity of <strong>adjacent</strong> data to optimize data size. Look at the column <code>PAYMENT_METHOD</code> in our example. What if, instead of storing a full 4-byte string, we only needed 1 bit for it: 0 for <code>card</code> and 1 for <code>bank</code>? Now the whole column becomes one long bitmap where each row consumes only 1 bit on disk.</li>
<li><strong>Access time</strong>: Disk access is a real bottleneck. When working with data on disk, we always need to use a different set of data structures and algorithms to minimize access time (B-tree for example). By accessing only the data needed to process the query and using data compression strategy, we can scan more rows in a single read. This means fewer reads to scan an entire table with trillions of rows, and therefore less disk access time.</li>
<li><strong>Throughput</strong>: Fetching only the necessary columns and better data compression also lead to better throughput, or the amount of data processed in a given time. Throughput is extremely important when compute and storage are not in the same place and data must be transferred over the network.</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>Each database implementation can vary in its specific optimization. However, the fundamental principle - storing and processing data by column rather than by row - remains the same, leading to significant performance gains for analytical queries. Understanding how your database works behind the scenes is beneficial for you as an engineer. Knowing what your tool does also means knowing what you do.</p>
]]></content:encoded></item><item><title>OLTP &amp; OLAP - Why we need Data Warehouse</title><link>https://note.datengineer.dev/posts/oltp-olap-why-we-need-data-warehouse/</link><pubDate>Wed, 28 Feb 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/oltp-olap-why-we-need-data-warehouse/</guid><description>Understand the fundamental distinctions between OLTP and OLAP databases, and gain insights into the need of a separate database called Data Warehouse</description><content:encoded><![CDATA[<p>Today, I was advising a team on building their data warehouse solution. I realized that even 40 years after the term &ldquo;data warehouse&rdquo; was first introduced, there are still questions about why we need a data warehouse and why we don&rsquo;t get all of the data from application databases, especially by executives. I write this post to answer these questions by clarifying the terms OLTP and OLAP, which are frequently used in discussions about data warehouse database architecture. Then I will explain why OLTP databases are inefficient for OLAP queries and why you need a separate database known as a data warehouse.</p>
<h2 id="oltp">OLTP</h2>
<p>OLTP, or <strong>Online Transaction Processing</strong>, is a pattern by which we access and manipulate data in the database transaction-by-transaction. A transaction refers to a single unit of work, such as a money transfer, a book, a blog post, and so on. Typically, users often only interact with one or a few transactions at a time. Therefore, most of the time, applications look up a small number of records in databases by some keys. Application databases implement special indexing techniques such as B-tree or LSM-tree to handle OLTP efficiently. They can quickly access a particular transaction given its indexed key.</p>
<h2 id="olap">OLAP</h2>
<p>As businesses grow and accumulate data, they need to analyze it to gain valuable insights about their market and customers. Then they can make informed decisions and gain competitive advantage. When it comes to analytics, access patterns will be very different. Typically, analytic queries consume a large number of records, look for only a few specific columns of each record, and often aggregate data to calculate statistics (min, max, sum, average,&hellip;). This pattern of accessing data in the database is called <strong>Online Analytic Processing</strong> (OLAP).</p>
<h2 id="difference-between-oltp-and-olap">Difference between OLTP and OLAP</h2>
<p>From the above definitions, we can somehow distinguish OLTP vs OLAP. The following table shows the typical differences:</p>
<table>
  <thead>
      <tr>
          <th></th>
          <th>OLTP</th>
          <th>OLAP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Access</td>
          <td>Small number of records, using indexed keys</td>
          <td>Large amount of records, often aggregate</td>
      </tr>
      <tr>
          <td>Purpose</td>
          <td>Application transactional consistency and speed</td>
          <td>Complex queries and analysis</td>
      </tr>
      <tr>
          <td>Users</td>
          <td>Application end users</td>
          <td>Analysts and business users.</td>
      </tr>
      <tr>
          <td>Data volume</td>
          <td>Relatively small, frequently accessed</td>
          <td>Large datasets, accessed less frequently</td>
      </tr>
      <tr>
          <td>Data type</td>
          <td>Real-time, current data</td>
          <td>Historical, aggregated data</td>
      </tr>
  </tbody>
</table>
<p><em>* Differences between OLTP and OLAP</em></p>
<h2 id="problems-of-oltp-databases-with-olap-queries">Problems of OLTP Databases with OLAP queries</h2>
<p>When your business is still young, it is easy to run analysis directly on application databases. However, as the volume of data and the need for analysis grows along with the business, problems arise. Databases that were optimized for OLTP using indexing techniques such as LSM tree or B-tree now struggle to execute OLAP queries efficiently. As a result, running OLAP queries becomes costly and negatively impacts application performance, which is critical to business success.</p>
<p>As the business continues to grow, different business units tend to operate independently with their own goals, priorities, concerns, and IT budgets. Each business unit will maintain its own applications running on separate databases. Performing analysis when data is scattered in different locations is difficult. And analysts often end up exporting data from different places, putting it into a single Excel file, and VLOOKUP.</p>
<h2 id="data-warehouse">Data Warehouse</h2>
<p>In response to the challenges of running OLAP queries on traditional business databases, the concept of a data warehouse is emerging as a solution.</p>
<ul>
<li>Data Warehouse functions as a dedicated space for analytical purpose. It allows business to store massive amounts of historical and current data without impacting operational databases.</li>
<li>Data Warehouses are designed with a focus on analytical processing. Their storage engines use specialized techniques to speed up OLAP queries. We may explore these techniques in other posts.</li>
<li>Data warehouses serve as a centralized repository for data from various sources. Analysis is now easier because all of the necessary data is in a single place.</li>
</ul>
<p><img alt="OLTP Databases to OLAP Data Warehouse" loading="lazy" src="/posts/oltp-olap-why-we-need-data-warehouse/images/data-from-oltp-databases-to-olap-data-warehouse.png"></p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we&rsquo;ve gone over the definitions and differences between OLTP and OLAP. We also looked into the role of the data warehouse in conducting business analysis. Understanding them should give you the confidence when you say to your boss, &ldquo;We need a data warehouse.&rdquo;</p>
]]></content:encoded></item><item><title>Recursive CTEs and CONNECT BY in SQL to query Hierarchical data</title><link>https://note.datengineer.dev/posts/recursive-ctes-and-connect-by-in-sql-to-query-hierarchical-data/</link><pubDate>Tue, 20 Feb 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/recursive-ctes-and-connect-by-in-sql-to-query-hierarchical-data/</guid><description>Discover the concept of hierarchical data in SQL and see real-life examples. Learn how to query hierarchical data and extract insights with advanced SQL features: Recursive CTEs and CONNECT BY.</description><content:encoded><![CDATA[<p>In database design, the idea of hierarchical data represents relationships between entities as a tree-like structure. This type of data model is widely used in many domains, such as file systems, organizational structure, etc. When dealing with hierarchical data, it is crucial to efficiently query and extract information about the relationships between entities. In this post, we will explore two powerful SQL tools for querying hierarchical data: recursive Common Table Expressions (CTEs) and the CONNECT BY clause.</p>
<h2 id="hierarchical-data">Hierarchical Data</h2>
<p>Hierarchical data represents a natural parent-child relationship that is often visualized in the form of a tree structure. Imagine a family tree: grandparents on top, parents in the middle, and you and siblings at the bottom, all connected. That&rsquo;s hierarchical data! It organizes information in levels, making it easy to understand how things are related. The most popular real-life example of hierarchical data is employee-manager relationships. Every employee is managed by a manager. And that manager is also an employee and (again) is managed by another manager who is also an employee.</p>
<p><img alt="Hierarchical Data example Employee-Manager relationship" loading="lazy" src="/posts/recursive-ctes-and-connect-by-in-sql-to-query-hierarchical-data/images/hierarchical-data-exampl-employee-manager-sql.png"></p>
<h2 id="hierarchical-data-representation-in-sql">Hierarchical Data representation in SQL</h2>
<p>Relational models perform best with flat tables with rows and columns, not tree-like structures. However, people developed techniques to represent hierarchical data in SQL. The most common approach is referencing using foreign key. In the above example, we will add a column manager_id referring to the person who managed this employee to the employee table.</p>
<table>
  <thead>
      <tr>
          <th>EMPLOYEE_ID</th>
          <th>NAME</th>
          <th>SALARY</th>
          <th>MANAGER_ID</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Adam</td>
          <td>60000</td>
          <td>NULL</td>
      </tr>
      <tr>
          <td>2</td>
          <td>John</td>
          <td>30000</td>
          <td>1</td>
      </tr>
      <tr>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
          <td>&hellip;</td>
      </tr>
  </tbody>
</table>
<p><em>*Example SQL table representing hierarchical data structure</em></p>
<h2 id="querying-hierarchical-data">Querying Hierarchical Data</h2>
<p>When querying hierarchical data, we often want to understand the relationship in both directions: who manages whom and who is managed by whom. However, querying hierarchical data is tricky because we don&rsquo;t know the depth of the tree, i.e. how many levels of hierarchy there are. Before we look at how to do this in SQL, let&rsquo;s prepare some data to work with. Note that all SQL code in this post is Oracle, as it natively supports CONNECT BY. Other RDBMS SQL should be similar.</p>


<p><details >
  <summary markdown="span"><em>Example data</em></summary>
  <table>
  <thead>
      <tr>
          <th style="text-align: right">EMPLOYEE_ID</th>
          <th>NAME</th>
          <th style="text-align: right">SALARY</th>
          <th style="text-align: right">MANAGER_ID</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: right">1</td>
          <td>Adam</td>
          <td style="text-align: right">60000</td>
          <td style="text-align: right">null</td>
      </tr>
      <tr>
          <td style="text-align: right">2</td>
          <td>Sarah</td>
          <td style="text-align: right">70000</td>
          <td style="text-align: right">null</td>
      </tr>
      <tr>
          <td style="text-align: right">3</td>
          <td>David</td>
          <td style="text-align: right">50000</td>
          <td style="text-align: right">1</td>
      </tr>
      <tr>
          <td style="text-align: right">4</td>
          <td>Emily</td>
          <td style="text-align: right">55000</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">5</td>
          <td>Michael</td>
          <td style="text-align: right">45000</td>
          <td style="text-align: right">1</td>
      </tr>
      <tr>
          <td style="text-align: right">6</td>
          <td>Jessica</td>
          <td style="text-align: right">50000</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">7</td>
          <td>Ben</td>
          <td style="text-align: right">35000</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">8</td>
          <td>Olivia</td>
          <td style="text-align: right">37000</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">9</td>
          <td>Charles</td>
          <td style="text-align: right">32000</td>
          <td style="text-align: right">5</td>
      </tr>
      <tr>
          <td style="text-align: right">10</td>
          <td>Sophia</td>
          <td style="text-align: right">33000</td>
          <td style="text-align: right">6</td>
      </tr>
      <tr>
          <td style="text-align: right">11</td>
          <td>Alex</td>
          <td style="text-align: right">37000</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">12</td>
          <td>Maya</td>
          <td style="text-align: right">38000</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">13</td>
          <td>Daniel</td>
          <td style="text-align: right">35000</td>
          <td style="text-align: right">5</td>
      </tr>
      <tr>
          <td style="text-align: right">14</td>
          <td>Isabella</td>
          <td style="text-align: right">36000</td>
          <td style="text-align: right">6</td>
      </tr>
      <tr>
          <td style="text-align: right">15</td>
          <td>Ryan</td>
          <td style="text-align: right">25000</td>
          <td style="text-align: right">7</td>
      </tr>
      <tr>
          <td style="text-align: right">16</td>
          <td>Chloe</td>
          <td style="text-align: right">26000</td>
          <td style="text-align: right">8</td>
      </tr>
      <tr>
          <td style="text-align: right">17</td>
          <td>Noah</td>
          <td style="text-align: right">24000</td>
          <td style="text-align: right">9</td>
      </tr>
      <tr>
          <td style="text-align: right">18</td>
          <td>Mia</td>
          <td style="text-align: right">25000</td>
          <td style="text-align: right">10</td>
      </tr>
      <tr>
          <td style="text-align: right">19</td>
          <td>Liam</td>
          <td style="text-align: right">26000</td>
          <td style="text-align: right">11</td>
      </tr>
      <tr>
          <td style="text-align: right">20</td>
          <td>Evelyn</td>
          <td style="text-align: right">27000</td>
          <td style="text-align: right">12</td>
      </tr>
      <tr>
          <td style="text-align: right">21</td>
          <td>William</td>
          <td style="text-align: right">25000</td>
          <td style="text-align: right">13</td>
      </tr>
      <tr>
          <td style="text-align: right">22</td>
          <td>Charlotte</td>
          <td style="text-align: right">26000</td>
          <td style="text-align: right">14</td>
      </tr>
      <tr>
          <td style="text-align: right">23</td>
          <td>Ethan</td>
          <td style="text-align: right">27000</td>
          <td style="text-align: right">7</td>
      </tr>
      <tr>
          <td style="text-align: right">24</td>
          <td>Ava</td>
          <td style="text-align: right">28000</td>
          <td style="text-align: right">8</td>
      </tr>
      <tr>
          <td style="text-align: right">25</td>
          <td>Lucas</td>
          <td style="text-align: right">26000</td>
          <td style="text-align: right">9</td>
      </tr>
      <tr>
          <td style="text-align: right">26</td>
          <td>Amelia</td>
          <td style="text-align: right">27000</td>
          <td style="text-align: right">10</td>
      </tr>
      <tr>
          <td style="text-align: right">27</td>
          <td>Mason</td>
          <td style="text-align: right">28000</td>
          <td style="text-align: right">11</td>
      </tr>
      <tr>
          <td style="text-align: right">28</td>
          <td>Harper</td>
          <td style="text-align: right">29000</td>
          <td style="text-align: right">12</td>
      </tr>
      <tr>
          <td style="text-align: right">29</td>
          <td>Logan</td>
          <td style="text-align: right">27000</td>
          <td style="text-align: right">13</td>
      </tr>
      <tr>
          <td style="text-align: right">30</td>
          <td>Sofia</td>
          <td style="text-align: right">28000</td>
          <td style="text-align: right">14</td>
      </tr>
  </tbody>
</table>

</details></p>

<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">EMPLOYEE_ID</span><span class="w"> </span><span class="nb">NUMBER</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">NAME</span><span class="w"> </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">SALARY</span><span class="w"> </span><span class="nb">NUMBER</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">MANAGER_ID</span><span class="w"> </span><span class="nb">NUMBER</span><span class="w"> </span><span class="k">REFERENCES</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="p">(</span><span class="n">EMPLOYEE_ID</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="w"> </span><span class="k">VALUES</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Adam&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">60000</span><span class="p">,</span><span class="w"> </span><span class="k">NULL</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Sarah&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">70000</span><span class="p">,</span><span class="w"> </span><span class="k">NULL</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;David&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">50000</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Emily&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">55000</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Michael&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">45000</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">6</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Jessica&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">50000</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">7</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Ben&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">35000</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Olivia&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">37000</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">9</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Charles&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">32000</span><span class="p">,</span><span class="w"> </span><span class="mi">5</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Sophia&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">33000</span><span class="p">,</span><span class="w"> </span><span class="mi">6</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">11</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Alex&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">37000</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Maya&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">38000</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">13</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Daniel&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">35000</span><span class="p">,</span><span class="w"> </span><span class="mi">5</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">14</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Isabella&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">36000</span><span class="p">,</span><span class="w"> </span><span class="mi">6</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">15</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Ryan&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">25000</span><span class="p">,</span><span class="w"> </span><span class="mi">7</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Chloe&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">26000</span><span class="p">,</span><span class="w"> </span><span class="mi">8</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">17</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Noah&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">24000</span><span class="p">,</span><span class="w"> </span><span class="mi">9</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">18</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Mia&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">25000</span><span class="p">,</span><span class="w"> </span><span class="mi">10</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">19</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Liam&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">26000</span><span class="p">,</span><span class="w"> </span><span class="mi">11</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">20</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Evelyn&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">27000</span><span class="p">,</span><span class="w"> </span><span class="mi">12</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">21</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;William&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">25000</span><span class="p">,</span><span class="w"> </span><span class="mi">13</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">22</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Charlotte&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">26000</span><span class="p">,</span><span class="w"> </span><span class="mi">14</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">23</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Ethan&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">27000</span><span class="p">,</span><span class="w"> </span><span class="mi">7</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">24</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Ava&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">28000</span><span class="p">,</span><span class="w"> </span><span class="mi">8</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">25</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Lucas&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">26000</span><span class="p">,</span><span class="w"> </span><span class="mi">9</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">26</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Amelia&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">27000</span><span class="p">,</span><span class="w"> </span><span class="mi">10</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">27</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Mason&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">28000</span><span class="p">,</span><span class="w"> </span><span class="mi">11</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">28</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Harper&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">29000</span><span class="p">,</span><span class="w"> </span><span class="mi">12</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">29</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Logan&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">27000</span><span class="p">,</span><span class="w"> </span><span class="mi">13</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="mi">30</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Sofia&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">28000</span><span class="p">,</span><span class="w"> </span><span class="mi">14</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></div><p><img alt="Hierarchical Data example in SQL query" loading="lazy" src="/posts/recursive-ctes-and-connect-by-in-sql-to-query-hierarchical-data/images/hierarchical-data-examples-in-sql-query.jpg"></p>
<h3 id="problem-statement">Problem statement</h3>
<p>We want to look at it from two directions:</p>
<ul>
<li><strong>Problem 1</strong>: Start with individual employees and follow the ladder, revealing who manages them, their manager&rsquo;s manager, and so on, all the way to the top.</li>
<li><strong>Problem 2</strong>: Stand at the highest level and look down the hierarchy. For each employee, calculate the total salary of everyone under their direct or indirect management, like a salary pyramid.</li>
</ul>
<p>When looking at direct relationships (who manages who), a simple join clause will work. But things get tricky when we include multiple levels in the hierarchy.</p>
<h3 id="recursive-cte">Recursive CTE</h3>
<p>A recursive CTE (Common Table Expression) is a valuable feature in SQL. By referencing itself within the CTE, it allows you to query hierarchical data by repeating a process until it reaches every corner, making it an effective tool for querying and analyzing hierarchical data.</p>
<p>The recursive CTE syntax is not too different from non-recursive one.</p>
<p>Non-recursive CTE:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="n">CTE_NAME</span><span class="w"> </span><span class="p">(</span><span class="n">column_1</span><span class="p">,</span><span class="w"> </span><span class="n">column2</span><span class="p">,</span><span class="w"> </span><span class="p">...)</span><span class="w"> </span><span class="k">AS</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="c1">-- CTE_QUERY_DEFINITION (SELECT ... FROM ... WHERE)
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p>Recursive CTE:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="n">CTE_NAME</span><span class="w"> </span><span class="p">(</span><span class="n">column_1</span><span class="p">,</span><span class="w"> </span><span class="n">column2</span><span class="p">,</span><span class="w"> </span><span class="p">...)</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="c1">-- ANCHOR_MEMBER (SELECT ... FROM ... WHERE BASE_LEVEL)
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">UNION</span><span class="w"> </span><span class="p">(</span><span class="k">ALL</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="c1">-- RECURSIVE_MEMBER (SELECT ... FROM (referrence to CTE_NAME) WHERE ...)
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><p>The definition of a recursive CTE consists of two parts. The anchor member or the initial query, which is executed once. This defines the starting point of the execution. The recursive query has the reference to the CTE itself and is executed recursively until it returns no result. And we need UNION or UNION ALL to combine the results. You will see how it work when use it to solve real problems.</p>
<h4 id="problem-1">Problem 1</h4>
<p>For each employee, we get the direct manager, the path from the highest-level manager, and the employee&rsquo;s current level.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="n">company_hierarchy</span><span class="w"> </span><span class="p">(</span><span class="n">EMPLOYEE_ID</span><span class="p">,</span><span class="w"> </span><span class="n">NAME</span><span class="p">,</span><span class="w"> </span><span class="n">MANAGER</span><span class="p">,</span><span class="w"> </span><span class="n">MANAGER_PATH</span><span class="p">,</span><span class="w"> </span><span class="n">EMPLOYEE_LEVEL</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="c1">-- Base query. Select ALL employees with no manager
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">    </span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">emp</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">emp</span><span class="p">.</span><span class="n">NAME</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="k">NULL</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">MANAGER</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="s1">&#39;&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">MANAGER_PATH</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="mi">1</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">EMPLOYEE_LEVEL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="w"> </span><span class="n">emp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">WHERE</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">UNION</span><span class="w"> </span><span class="k">ALL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="c1">-- Recursive query which refer to the CTE itself.
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">    </span><span class="c1">-- Query all employee who is managed directly by previous level in company_hierarchy
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">    </span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">emp</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">emp</span><span class="p">.</span><span class="n">NAME</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">com</span><span class="p">.</span><span class="n">NAME</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">MANAGER</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">com</span><span class="p">.</span><span class="n">MANAGER_PATH</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">com</span><span class="p">.</span><span class="n">NAME</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39;/&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">MANAGER_PATH</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">com</span><span class="p">.</span><span class="n">EMPLOYEE_LEVEL</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">EMPLOYEE_LEVEL</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="w"> </span><span class="n">emp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">INNER</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">company_hierarchy</span><span class="w"> </span><span class="n">com</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">ON</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">com</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">company_hierarchy</span><span class="w">
</span></span></span></code></pre></div>

<p><details >
  <summary markdown="span"><em>Recursive CTE query result</em></summary>
  <table>
  <thead>
      <tr>
          <th style="text-align: right">EMPLOYEE_ID</th>
          <th>NAME</th>
          <th>MANAGER</th>
          <th>MANAGER_PATH</th>
          <th style="text-align: right">EMPLOYEE_LEVEL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: right">1</td>
          <td>Adam</td>
          <td>null</td>
          <td>null</td>
          <td style="text-align: right">1</td>
      </tr>
      <tr>
          <td style="text-align: right">2</td>
          <td>Sarah</td>
          <td>null</td>
          <td>null</td>
          <td style="text-align: right">1</td>
      </tr>
      <tr>
          <td style="text-align: right">3</td>
          <td>David</td>
          <td>Adam</td>
          <td>Adam/</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">5</td>
          <td>Michael</td>
          <td>Adam</td>
          <td>Adam/</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">4</td>
          <td>Emily</td>
          <td>Sarah</td>
          <td>Sarah/</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">6</td>
          <td>Jessica</td>
          <td>Sarah</td>
          <td>Sarah/</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">7</td>
          <td>Ben</td>
          <td>David</td>
          <td>Adam/David/</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">11</td>
          <td>Alex</td>
          <td>David</td>
          <td>Adam/David/</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">9</td>
          <td>Charles</td>
          <td>Michael</td>
          <td>Adam/Michael/</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">13</td>
          <td>Daniel</td>
          <td>Michael</td>
          <td>Adam/Michael/</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">8</td>
          <td>Olivia</td>
          <td>Emily</td>
          <td>Sarah/Emily/</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">12</td>
          <td>Maya</td>
          <td>Emily</td>
          <td>Sarah/Emily/</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">10</td>
          <td>Sophia</td>
          <td>Jessica</td>
          <td>Sarah/Jessica/</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">14</td>
          <td>Isabella</td>
          <td>Jessica</td>
          <td>Sarah/Jessica/</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">15</td>
          <td>Ryan</td>
          <td>Ben</td>
          <td>Adam/David/Ben/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">23</td>
          <td>Ethan</td>
          <td>Ben</td>
          <td>Adam/David/Ben/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">19</td>
          <td>Liam</td>
          <td>Alex</td>
          <td>Adam/David/Alex/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">27</td>
          <td>Mason</td>
          <td>Alex</td>
          <td>Adam/David/Alex/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">17</td>
          <td>Noah</td>
          <td>Charles</td>
          <td>Adam/Michael/Charles/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">25</td>
          <td>Lucas</td>
          <td>Charles</td>
          <td>Adam/Michael/Charles/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">21</td>
          <td>William</td>
          <td>Daniel</td>
          <td>Adam/Michael/Daniel/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">29</td>
          <td>Logan</td>
          <td>Daniel</td>
          <td>Adam/Michael/Daniel/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">16</td>
          <td>Chloe</td>
          <td>Olivia</td>
          <td>Sarah/Emily/Olivia/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">24</td>
          <td>Ava</td>
          <td>Olivia</td>
          <td>Sarah/Emily/Olivia/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">20</td>
          <td>Evelyn</td>
          <td>Maya</td>
          <td>Sarah/Emily/Maya/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">28</td>
          <td>Harper</td>
          <td>Maya</td>
          <td>Sarah/Emily/Maya/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">18</td>
          <td>Mia</td>
          <td>Sophia</td>
          <td>Sarah/Jessica/Sophia/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">26</td>
          <td>Amelia</td>
          <td>Sophia</td>
          <td>Sarah/Jessica/Sophia/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">22</td>
          <td>Charlotte</td>
          <td>Isabella</td>
          <td>Sarah/Jessica/Isabella/</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">30</td>
          <td>Sofia</td>
          <td>Isabella</td>
          <td>Sarah/Jessica/Isabella/</td>
          <td style="text-align: right">4</td>
      </tr>
  </tbody>
</table>

</details></p>

<p>The below picture illustrates how the recursive CTE works:</p>
<p><img alt="Recursive CTE in SQL Flowchart" loading="lazy" src="/posts/recursive-ctes-and-connect-by-in-sql-to-query-hierarchical-data/images/recursive-cte-in-sql-how-it-work-flowchart.png"></p>
<p>First, the anchor member or initial query was executed. <code>company_hierarchy</code> CTE contained 2 members with <code>MANAGER_ID IS NULL</code> and <code>EMPLOYEE_LEVEL=1</code>.</p>
<p>Then the recursive query was executed, joining the <code>EMPLOYEES</code> table with <code>company_hierarchy</code> to get all members managed by 2 employees currently in <code>company_hierarchy</code>. <code>company_hierarchy</code> was then set to all returned employees (EMPLOYEE_LEVEL=2).</p>
<p>We kept doing so until the recursive query returned nothing.</p>
<p>At the end, we <code>UNION ALL</code> all the results.</p>
<h4 id="problem-2">Problem 2</h4>
<p>This is a bit more difficult. Let&rsquo;s reread the <a href="#problem-statement">problem statement</a> and try yourself before reading my answer.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="n">company_explode</span><span class="w"> </span><span class="p">(</span><span class="n">MANAGER_ID</span><span class="p">,</span><span class="w"> </span><span class="n">EMPLOYEE_ID</span><span class="p">,</span><span class="w"> </span><span class="n">SALARY</span><span class="p">,</span><span class="w"> </span><span class="n">MANAGEMENT_DISTANCE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="c1">-- Base query. Select ALL employees
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">    </span><span class="k">SELECT</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">emp</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">emp</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">emp</span><span class="p">.</span><span class="n">SALARY</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="mi">1</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">MANAGEMENT_DISTANCE</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="w"> </span><span class="n">emp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">UNION</span><span class="w"> </span><span class="k">ALL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="c1">-- Recursive query. Select ALL managers manages the employees
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">    </span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">mgr</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">com</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">com</span><span class="p">.</span><span class="n">SALARY</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="n">com</span><span class="p">.</span><span class="n">MANAGEMENT_DISTANCE</span><span class="o">+</span><span class="mi">1</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">MANAGEMENT_DISTANCE</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="w"> </span><span class="n">mgr</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">INNER</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">company_explode</span><span class="w"> </span><span class="n">com</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">ON</span><span class="w"> </span><span class="n">mgr</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">com</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">MANAGER_ID</span><span class="p">,</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">SALARY</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">TOTAL_SALARY_MANAGED</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">company_explode</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">MANAGER_ID</span><span class="w">
</span></span></span></code></pre></div>

<p><details >
  <summary markdown="span"><em>Recursive CTE result</em></summary>
  <table>
  <thead>
      <tr>
          <th style="text-align: right">MANAGER_ID</th>
          <th style="text-align: right">TOTAL_SALARY_MANAGED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: right">null</td>
          <td style="text-align: right">1037000</td>
      </tr>
      <tr>
          <td style="text-align: right">1</td>
          <td style="text-align: right">442000</td>
      </tr>
      <tr>
          <td style="text-align: right">2</td>
          <td style="text-align: right">465000</td>
      </tr>
      <tr>
          <td style="text-align: right">3</td>
          <td style="text-align: right">178000</td>
      </tr>
      <tr>
          <td style="text-align: right">4</td>
          <td style="text-align: right">185000</td>
      </tr>
      <tr>
          <td style="text-align: right">5</td>
          <td style="text-align: right">169000</td>
      </tr>
      <tr>
          <td style="text-align: right">6</td>
          <td style="text-align: right">175000</td>
      </tr>
      <tr>
          <td style="text-align: right">7</td>
          <td style="text-align: right">52000</td>
      </tr>
      <tr>
          <td style="text-align: right">8</td>
          <td style="text-align: right">54000</td>
      </tr>
      <tr>
          <td style="text-align: right">9</td>
          <td style="text-align: right">50000</td>
      </tr>
      <tr>
          <td style="text-align: right">10</td>
          <td style="text-align: right">52000</td>
      </tr>
      <tr>
          <td style="text-align: right">11</td>
          <td style="text-align: right">54000</td>
      </tr>
      <tr>
          <td style="text-align: right">12</td>
          <td style="text-align: right">56000</td>
      </tr>
      <tr>
          <td style="text-align: right">13</td>
          <td style="text-align: right">52000</td>
      </tr>
      <tr>
          <td style="text-align: right">14</td>
          <td style="text-align: right">54000</td>
      </tr>
  </tbody>
</table>

</details></p>

<p>The idea behind this solution is simple. First, explode the hierarchical structure of the data into a flattened table of all manager-employee pairs. Then calculate the sum with a simple <code>GROUP BY</code> clause.</p>
<h3 id="start-with--connect-by-">START WITH &hellip; CONNECT BY &hellip;</h3>
<p>CONNECT BY clause achieves similar functionality as a recursive Common Table Expression (CTE), but with a much shorter syntax. Let&rsquo;s take a look at it in action.</p>
<h4 id="problem-1-1">Problem 1</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">emp</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">,</span><span class="n">emp</span><span class="p">.</span><span class="n">NAME</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">,</span><span class="n">mgr</span><span class="p">.</span><span class="n">NAME</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">MANAGER</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">,</span><span class="k">TRIM</span><span class="p">(</span><span class="k">LEADING</span><span class="w"> </span><span class="s1">&#39;/&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">SYS_CONNECT_BY_PATH</span><span class="p">(</span><span class="n">mgr</span><span class="p">.</span><span class="n">NAME</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;/&#39;</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">MANAGER_PATH</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">,</span><span class="k">LEVEL</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">EMPLOYEE_LEVEL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="w"> </span><span class="n">emp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="w"> </span><span class="n">mgr</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">ON</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mgr</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">START</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">CONNECT</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">PRIOR</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div>

<p><details >
  <summary markdown="span"><em>CONNECT BY query result</em></summary>
  <table>
  <thead>
      <tr>
          <th style="text-align: right">EMPLOYEE_ID</th>
          <th>NAME</th>
          <th>MANAGER</th>
          <th>MANAGER_PATH</th>
          <th style="text-align: right">EMPLOYEE_LEVEL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: right">1</td>
          <td>Adam</td>
          <td>null</td>
          <td>null</td>
          <td style="text-align: right">1</td>
      </tr>
      <tr>
          <td style="text-align: right">3</td>
          <td>David</td>
          <td>Adam</td>
          <td>Adam</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">7</td>
          <td>Ben</td>
          <td>David</td>
          <td>Adam/David</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">15</td>
          <td>Ryan</td>
          <td>Ben</td>
          <td>Adam/David/Ben</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">23</td>
          <td>Ethan</td>
          <td>Ben</td>
          <td>Adam/David/Ben</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">11</td>
          <td>Alex</td>
          <td>David</td>
          <td>Adam/David</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">19</td>
          <td>Liam</td>
          <td>Alex</td>
          <td>Adam/David/Alex</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">27</td>
          <td>Mason</td>
          <td>Alex</td>
          <td>Adam/David/Alex</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">5</td>
          <td>Michael</td>
          <td>Adam</td>
          <td>Adam</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">9</td>
          <td>Charles</td>
          <td>Michael</td>
          <td>Adam/Michael</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">17</td>
          <td>Noah</td>
          <td>Charles</td>
          <td>Adam/Michael/Charles</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">25</td>
          <td>Lucas</td>
          <td>Charles</td>
          <td>Adam/Michael/Charles</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">13</td>
          <td>Daniel</td>
          <td>Michael</td>
          <td>Adam/Michael</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">21</td>
          <td>William</td>
          <td>Daniel</td>
          <td>Adam/Michael/Daniel</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">29</td>
          <td>Logan</td>
          <td>Daniel</td>
          <td>Adam/Michael/Daniel</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">2</td>
          <td>Sarah</td>
          <td>null</td>
          <td>null</td>
          <td style="text-align: right">1</td>
      </tr>
      <tr>
          <td style="text-align: right">4</td>
          <td>Emily</td>
          <td>Sarah</td>
          <td>Sarah</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">8</td>
          <td>Olivia</td>
          <td>Emily</td>
          <td>Sarah/Emily</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">16</td>
          <td>Chloe</td>
          <td>Olivia</td>
          <td>Sarah/Emily/Olivia</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">24</td>
          <td>Ava</td>
          <td>Olivia</td>
          <td>Sarah/Emily/Olivia</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">12</td>
          <td>Maya</td>
          <td>Emily</td>
          <td>Sarah/Emily</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">20</td>
          <td>Evelyn</td>
          <td>Maya</td>
          <td>Sarah/Emily/Maya</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">28</td>
          <td>Harper</td>
          <td>Maya</td>
          <td>Sarah/Emily/Maya</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">6</td>
          <td>Jessica</td>
          <td>Sarah</td>
          <td>Sarah</td>
          <td style="text-align: right">2</td>
      </tr>
      <tr>
          <td style="text-align: right">10</td>
          <td>Sophia</td>
          <td>Jessica</td>
          <td>Sarah/Jessica</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">18</td>
          <td>Mia</td>
          <td>Sophia</td>
          <td>Sarah/Jessica/Sophia</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">26</td>
          <td>Amelia</td>
          <td>Sophia</td>
          <td>Sarah/Jessica/Sophia</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">14</td>
          <td>Isabella</td>
          <td>Jessica</td>
          <td>Sarah/Jessica</td>
          <td style="text-align: right">3</td>
      </tr>
      <tr>
          <td style="text-align: right">22</td>
          <td>Charlotte</td>
          <td>Isabella</td>
          <td>Sarah/Jessica/Isabella</td>
          <td style="text-align: right">4</td>
      </tr>
      <tr>
          <td style="text-align: right">30</td>
          <td>Sofia</td>
          <td>Isabella</td>
          <td>Sarah/Jessica/Isabella</td>
          <td style="text-align: right">4</td>
      </tr>
  </tbody>
</table>

</details></p>

<h4 id="problem-2-1">Problem 2</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">emp</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">,</span><span class="k">SUM</span><span class="p">(</span><span class="n">CONNECT_BY_ROOT</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">SALARY</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">TOTAL_SALARY_MANAGED</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">EMPLOYEES</span><span class="w"> </span><span class="n">emp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">CONNECT</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">PRIOR</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">EMPLOYEE_ID</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">emp</span><span class="p">.</span><span class="n">MANAGER_ID</span><span class="w">
</span></span></span></code></pre></div>

<p><details >
  <summary markdown="span"><em>CONNECT BY query result</em></summary>
  <table>
  <thead>
      <tr>
          <th style="text-align: right">MANAGER_ID</th>
          <th style="text-align: right">TOTAL_SALARY_MANAGED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: right">null</td>
          <td style="text-align: right">1037000</td>
      </tr>
      <tr>
          <td style="text-align: right">1</td>
          <td style="text-align: right">442000</td>
      </tr>
      <tr>
          <td style="text-align: right">2</td>
          <td style="text-align: right">465000</td>
      </tr>
      <tr>
          <td style="text-align: right">3</td>
          <td style="text-align: right">178000</td>
      </tr>
      <tr>
          <td style="text-align: right">4</td>
          <td style="text-align: right">185000</td>
      </tr>
      <tr>
          <td style="text-align: right">5</td>
          <td style="text-align: right">169000</td>
      </tr>
      <tr>
          <td style="text-align: right">6</td>
          <td style="text-align: right">175000</td>
      </tr>
      <tr>
          <td style="text-align: right">7</td>
          <td style="text-align: right">52000</td>
      </tr>
      <tr>
          <td style="text-align: right">8</td>
          <td style="text-align: right">54000</td>
      </tr>
      <tr>
          <td style="text-align: right">9</td>
          <td style="text-align: right">50000</td>
      </tr>
      <tr>
          <td style="text-align: right">10</td>
          <td style="text-align: right">52000</td>
      </tr>
      <tr>
          <td style="text-align: right">11</td>
          <td style="text-align: right">54000</td>
      </tr>
      <tr>
          <td style="text-align: right">12</td>
          <td style="text-align: right">56000</td>
      </tr>
      <tr>
          <td style="text-align: right">13</td>
          <td style="text-align: right">52000</td>
      </tr>
      <tr>
          <td style="text-align: right">14</td>
          <td style="text-align: right">54000</td>
      </tr>
  </tbody>
</table>

</details></p>

<p>Like magic, isn&rsquo;t it! The implementation details (what goes on behind the scenes) may be a bit different. But the way we reason about the query is the same as it is with Recursive CTEs. The <code>START WITH</code> clause provides the initial filter. The query is first executed with this filter, just like the anchor member in recursive CTEs. No <code>START WITH</code> clause means no filter is needed and the entire query is included in the first step. Then the <code>CONNECT BY</code> clauses specify how to connect between steps/levels in the hierarchical structure, just like recursive CTEs refer to themselves. Note that the <code>PRIOR</code> keyword implies that the following value is from the previous recursive step.</p>
<p>One of the main differences between the two approaches is how we select data. With <code>CONNECT BY</code>, we have to rely on built-in functions and operators to select the data we want. In the examples, we use the <code>SYS_CONNECT_BY_PATH</code> function to construct the path and the <code>CONNECT_BY_ROOT</code> operator to access the data in the first step.</p>
<h2 id="my-final-thoughts">My final thoughts</h2>
<p>Querying hierarchical data presents unique challenges that require specialized techniques to extract meaningful information. Recursive CTEs and the CONNECT BY clause offer powerful solutions for navigating and analyzing hierarchical data in SQL. One interesting fact is that CONNECT BY was actually around before Recursive CTEs.</p>
<p>While both techniques solve the same problem, we can only use one. Which one should you use? Well, it depends on you. If you hate subqueries and CTEs, and you like cool short magic queries, use CONNECT BY. However, being less verbose makes CONNECT BY harder to reason about. It&rsquo;s harder to write out that magic stuff, and harder to know why it doesn&rsquo;t work. Also, by writing the recursion yourself, recursive CTEs give you more control and flexibility. And note that not all RDBMS (SQL Server for example) support the CONNECT BY clause, even though it has been around for a long time.</p>
<p><em>* You can find the execution of SQL in this post at <a href="https://dbfiddle.uk/hXEzBJkX" rel="nofollow"> https://dbfiddle.uk/hXEzBJkX </a>
</em></p>
]]></content:encoded></item><item><title>What is a reliable Data System?</title><link>https://note.datengineer.dev/posts/what-is-a-reliable-data-system/</link><pubDate>Fri, 16 Feb 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/what-is-a-reliable-data-system/</guid><description>Learn the concepts of reliability, and how to define a reliable data system</description><content:encoded><![CDATA[<p>In today&rsquo;s data-driven world, information is gold, and the systems that store and manage it serve as crucial infrastructure. I have seen people talk a lot about terms like &ldquo;distributed computing&rdquo;, &ldquo;scalability&rdquo;&hellip; but one fundamental characteristic is often overlooked: reliability. Without it, scalability, maintainability, flexibility, anything-bility are meaningless, like a beautiful castle built on sand.</p>
<h2 id="what-is-reliability">What is Reliability?</h2>
<p>Everyone has their own intuition about what is reliable:</p>
<ul>
<li>A piggy bank is reliable because it consistently holds your money and accurately reflects what you&rsquo;ve deposited. You trust that when you put a coin in, it will be there later, and the total will reflect your savings. And when you want to make a withdrawal, you can get your money immediately.</li>
<li>A calculator is reliable because it consistently produces accurate results based on your input. You trust that regardless of who uses it, 2 + 2 will always equal 4. And the result should appear instantly on the screen.</li>
</ul>
<p>Different systems have different reliability requirements. In general, we can define reliability as follow:</p>
<blockquote>
<p><em>Reliability refers to the ability to always do the expected things in the expected way.</em></p></blockquote>
<p>For software, reliability means consistently performing the designed function at the expected level of performance. Consider a calculator: we expect it to <strong>immediately</strong> display <code>4</code> after typing in <code>2+2</code>. If it shows me <code>5</code>, I will give it 1 star and never use it again. If it takes me 5 minutes to do such a simple arithmetic addition, I will send an email to the United Nations to report it as crypto-mining malware. (actually I won&rsquo;t)</p>
<p>Wait a minute! There is one more important word in my definition above: &ldquo;always&rdquo;. What do I mean by &ldquo;always&rdquo;? A piggy bank wouldn&rsquo;t be very reliable if it held my money and suddenly became inaccessible for a week. Of course, there is no perfect &ldquo;always&rdquo; in real world. There may be unforeseen situations that cause systems to stop working. But systems should be designed in such a way that the disruption doesn&rsquo;t hurt business operations. Reliability focuses on minimizing the occurrence of system failures and their impact on functionality.</p>
<h2 id="reliable-data-system">Reliable Data system</h2>
<p>Just like you trust your piggy bank to hold your coins securely, you need to trust your data systems to hold your information reliably. Your piggy bank wouldn&rsquo;t be very reliable if the coins sometimes disappeared, a data system wouldn&rsquo;t be reliable if the information kept changing or disappearing. Reliability means you can trust the information it holds. This means the data is always available, accurate, and delivers consistent results when you need it. Common expectations for a data system:</p>
<ul>
<li>Integrity: This ensures the data is accurate, complete, and consistent. Imagine your piggy bank if someone took coins without putting them back, or if different amounts appeared out of nowhere. It wouldn&rsquo;t be reliable! Similarly, data integrity prevents missing, incorrect, or inconsistent information, thereby ensuring its reliability.</li>
<li>Availability: You wouldn&rsquo;t find your piggy bank locked when you need it most. Likewise, reliable data systems must be accessible when you need them. This means the data is readily available for authorized users, minimizing downtime and ensuring critical information is always at hand.</li>
<li>Performance: A sluggish piggy bank wouldn&rsquo;t be very useful. Similar to how you expect quick access to your coins, data systems should deliver reasonable performance. This translates to fast retrieval times, smooth operation, and responsiveness to your needs, enabling efficient decision-making.</li>
<li>Timeliness: Data freshness is crucial. Old coins are worth the same, but old data is not. In data systems, timeliness ensures that information is current and up to date. This reduces reliance on outdated data, resulting in more accurate insights and informed actions.</li>
<li>Safety: Just like keeping your piggy bank safe from theft, protecting your data is critical. Data safety ensures that information is protected from unauthorized access. If someone you don&rsquo;t trust knows where you keep your piggy bank, you won&rsquo;t put any coins in it.</li>
</ul>
<p><img alt="Reliable data systems hold your information securely" loading="lazy" src="/posts/what-is-a-reliable-data-system/images/reliable-piggy-bank-reliable-data-system.jpg"></p>
<h2 id="how-important-is-reliability">How important is Reliability</h2>
<p>Reliability is not limited to life-or-death situations such as nuclear power plants. It is fundamental to all software applications, large and small. Sure, bugs in a note taking app may not have catastrophic consequences, but they do cause frustration and erode user trust. Let&rsquo;s shift our focus from &ldquo;avoiding disaster&rdquo; to &ldquo;delivering value&rdquo;. Every software application has a purpose, whether it&rsquo;s to simplify tasks, improve communication, or entertain users. When an application crashes, malfunctions, or produces incorrect results, it fails to fulfill its purpose. Every software application has a responsibility to its users. Frustrated users abandon unreliable applications, businesses lose productivity, and trust erodes. Investing in reliability is about more than avoiding the negative consequences of failure. It&rsquo;s about building trust, delivering value, and ensuring that your software does what it&rsquo;s supposed to do.</p>
]]></content:encoded></item><item><title>PySpark UDFs: A comprehensive guide to unlock PySpark potential</title><link>https://note.datengineer.dev/posts/pyspark-udfs-a-comprehensive-guide-to-unlock-pyspark-potential/</link><pubDate>Fri, 09 Feb 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/pyspark-udfs-a-comprehensive-guide-to-unlock-pyspark-potential/</guid><description>Discover the capabilities of User-Defined Functions (UDFs) in Apache Spark, allowing you to extend PySpark&amp;#39;s functionality and solve complex data processing tasks.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Apache Spark is a powerful open source distributed computing engine designed to handle large datasets across clusters. PySpark is the Python programming API for Spark. It allows data engineers and data scientists can easily utilize the framework in their preferred language.</p>
<p>This post is a continuation of the <a href="../a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/">previous tutorial</a>. Originally a Jupyter notebook I created while learning PySpark, I recently found it and decided to update it and publish it on my blog.</p>
<p>UDFs (user-defined functions) are an integral part of PySpark, allowing users to extend the capabilities of Spark by creating their own custom functions. This article will provide a comprehensive guide to PySpark UDFs with examples.</p>
<h2 id="understanding-pyspark-udfs">Understanding PySpark UDFs</h2>
<p>PySpark UDFs are user-defined functions written in Python code. We create functions in Python and register them with Spark as UDFs. They enable the execution of complicated custom logic on Spark DataFrames and SQL expressions.</p>
<p>However, note that UDFs are expensive. We should always prefer built-in functions whenever possible. PySpark comes with a number of predefined common functions, and many more new functions are added with each new release.</p>
<p>In summary, with PySpark UDFs, what goes in is a regular Python function, and what goes out is a function to work on the PySpark engine.</p>
<h2 id="creating-an-udf">Creating an UDF</h2>
<p>All of the following examples are a continuation of the <a href="../a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/">previous article</a>. You can find an executable notebook containing both articles <a href="https://gist.github.com/ThaiDat/81c3662801aa8410a65b94f3c993c377">here</a>.</p>
<p>Below is an example of a &ldquo;complicated&rdquo; decision tree function that classifies transactions:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># PySpark UDFs example</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">classify_tier</span><span class="p">(</span><span class="n">amount</span><span class="p">:</span><span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">500</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">10000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">100000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">2</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">1000000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">3</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">4</span>
</span></span></code></pre></div><p>It is a regular Python function that receive a <code>float</code> and return an <code>int</code>. We have to make it a PySpark UDF before actually using it.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># pyspark.sql.functions provides a udf() function to promote a regular function to be UDF.</span>
</span></span><span class="line"><span class="cl"><span class="c1"># The function takes two parameters: the function you want to promote, and the return type of the generated UDF</span>
</span></span><span class="line"><span class="cl"><span class="c1"># The function return a UDF</span>
</span></span><span class="line"><span class="cl"><span class="n">classifyTier</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">udf</span><span class="p">(</span><span class="n">classify_tier</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">ByteType</span><span class="p">())</span>
</span></span></code></pre></div><p>Then we can use it like any other PySpark function.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">,</span> <span class="n">classifyTier</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">amount</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;tier&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;tier&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----------+----+
|   nameOrig|tier|
+-----------+----+
|C1495608502|   4|
|C1321115948|   4|
| C476579021|   4|
|C1520267010|   4|
| C106297322|   4|
|C1464177809|   4|
| C355885103|   4|
|C1057507014|   4|
|C1419332030|   4|
|C2007599722|   4|
+-----------+----+
</code></pre><p>The <code>pyspark.sql.functions.udf()</code> function can also be used as a decorator which produce the same result.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># pyspark udf decorator example</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Note that classifyTier is a UDF, not a regular function anymore.</span>
</span></span><span class="line"><span class="cl"><span class="nd">@F.udf</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">ByteType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">classifyTier</span><span class="p">(</span><span class="n">amount</span><span class="p">:</span><span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">500</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">10000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">100000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">2</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">1000000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">3</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">4</span>
</span></span></code></pre></div><p>If you want to use it in a Spark SQL expression, we need to register it first.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Register the regular Python function with spark.udf.register</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">udf</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="s1">&#39;classifyTier&#39;</span><span class="p">,</span> <span class="n">classify_tier</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameOrig, classifyTier(amount) tier
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">    ORDER BY tier DESC 
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----------+----+
|   nameOrig|tier|
+-----------+----+
| C263860433|   4|
| C306269750|   4|
|C1611915976|   4|
|C1387188921|   4|
| C300262358|   4|
| C389879985|   4|
|C1907016309|   4|
|C1046638041|   4|
|C1543404166|   4|
|C1155108056|   4|
+-----------+----+
</code></pre><p>Simple enough? Write a Python function, make it a UDF, use it. But it is not the most interesting part.</p>
<h2 id="pandas-udf">Pandas UDF</h2>
<p>With Python UDFs, PySpark will unpack each value, perform the calculation, and then return the value for each record. A Pandas UDF is a user-defined function that works with data using Pandas for manipulation and Apache Arrow for data transfer. It is also called a vectorized UDF. Compared to row-at-a-time Python UDFs, pandas UDFs enable vectorized operations that can improve performance by up to 100x.</p>
<h3 id="series-to-series-udf">Series to Series UDF</h3>
<p>These UDFs operate on Pandas Series and return a Pandas Series as output. When Spark runs a Pandas UDF, it divides the columns into batches, calls the function on a subset of the data for each batch, and then concatenates the output. It is preferable to use a Pandas Series-to-Series UDF if possible, instead of using a regular Python UDF. We use <code>pyspark.sql.functions.pandas_udf</code> to create a Pandas UDF.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># You can also promote the function to PySpark Pandas UDF as getUserType = F.pandas_udf(get_user_type, T.StringType())</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Each User ID starts with a letter represent its type</span>
</span></span><span class="line"><span class="cl"><span class="nd">@F.pandas_udf</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">getUserType</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">name</span><span class="o">.</span><span class="n">str</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span></code></pre></div><p>The only difference in syntax is that the Python function now takes a <code>pandas.Series' and returns a </code>pandas.Series&rsquo;. And then we can use it as a Spark function.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">getUserType</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">nameDest</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;userTypeDest&#39;</span><span class="p">),</span> <span class="n">df</span><span class="o">.</span><span class="n">amount</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;userTypeDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;n&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+------------+------------------+-------+
|userTypeDest|         avgAmount|      n|
+------------+------------------+-------+
|           C| 265083.4571810173|4211125|
|           M|13057.604660187604|2151495|
+------------+------------------+-------+
</code></pre><h3 id="iterator-of-series-to-iterator-of-series">Iterator of Series to Iterator of Series</h3>
<p>Due to the distributed nature of Spark, the entire series is not fed into the UDF at once; instead, each cluster calls the UDF on its own batch of data and then aggregates the result. PySpark Iterator of Series to Iterator of Series UDFs are very useful when we have an time-consuming cold start operation (e.g. initialize a machine learning model, check for some server statuses,&hellip;) that you need to perform once at the beginning of the processing step.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">sleep</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Iterator</span><span class="p">,</span> <span class="n">Tuple</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@F.pandas_udf</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">ByteType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">getNameIdLength</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="n">Iterator</span><span class="p">[</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">Iterator</span><span class="p">[</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Heavy task</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># sleep(5)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1"># name is a Iterator</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># name_batch is a pd.Series</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">name_batch</span> <span class="ow">in</span> <span class="n">name</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">name_len</span> <span class="o">=</span> <span class="n">name_batch</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="n">name_len</span><span class="p">[</span><span class="o">~</span><span class="n">name_batch</span><span class="o">.</span><span class="n">str</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">isnumeric</span><span class="p">()]</span> <span class="o">-=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># yield because we return an iterator</span>
</span></span><span class="line"><span class="cl">        <span class="k">yield</span> <span class="n">name_len</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">getNameIdLength</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">nameOrig</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;idLen&#39;</span><span class="p">),</span> <span class="s1">&#39;amount&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;idLen&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----+------------------+
|idLen|         avgAmount|
+-----+------------------+
|    4|155070.73742857145|
|    7|177477.50726081585|
|   10| 179702.4408980949|
|    9|179898.05510125632|
|    8| 181572.2097899971|
|    6|197756.81529433408|
|    5|199594.79368029739|
+-----+------------------+
</code></pre><h3 id="iterator-of-multiple-series-to-iterator-of-series-udf">Iterator of multiple Series to Iterator of Series UDF</h3>
<p>Iterator of Multiple Series to Iterator of Series UDF has the same characteristics as Iterator of Series to Iterator of Series UDF. The difference is that the underlying Python function receives an iterator for a tuple of Pandas Series.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">amount_mismatch</span><span class="p">(</span><span class="n">values</span><span class="p">:</span> <span class="n">Iterator</span><span class="p">[</span><span class="n">Tuple</span><span class="p">[</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">]])</span> <span class="o">-&gt;</span> <span class="n">Iterator</span><span class="p">[</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Heavy task</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">oldOrig</span><span class="p">,</span> <span class="n">newOrig</span><span class="p">,</span> <span class="n">oldDest</span><span class="p">,</span> <span class="n">newDest</span> <span class="ow">in</span> <span class="n">values</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">yield</span> <span class="nb">abs</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">newOrig</span> <span class="o">-</span> <span class="n">oldOrig</span><span class="p">)</span> <span class="o">-</span> <span class="nb">abs</span><span class="p">(</span><span class="n">newDest</span> <span class="o">-</span> <span class="n">oldDest</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Create an UDF. You can also use decorator.</span>
</span></span><span class="line"><span class="cl"><span class="n">amountMismatch</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">pandas_udf</span><span class="p">(</span><span class="n">amount_mismatch</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">())</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">df</span><span class="o">.</span><span class="n">type</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">amountMismatch</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">oldBalanceOrig</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">newBalanceOrig</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">oldBalanceDest</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">newBalanceDest</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;mismatch&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;mismatch&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgMismatch&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;avgMismatch&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+------------------+
|    type|       avgMismatch|
+--------+------------------+
|TRANSFER| 968056.4538892006|
|CASH_OUT|170539.39652580014|
| CASH_IN| 50038.95466155722|
|   DEBIT| 25567.53969902471|
| PAYMENT| 6378.936662041953|
+--------+------------------+
</code></pre><h3 id="group-aggregate-udf">Group aggregate UDF</h3>
<p>Group aggregate UDF, also known as the Series to Scalar UDF, reduces the input <code>pandas.Series</code> into a single value.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@F.pandas_udf</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">getStdDeviation</span><span class="p">(</span><span class="n">series</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Use built-in pandas.Series.std</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">series</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">getStdDeviation</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">amount</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;var&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;var&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+------------------+
|    type|               var|
+--------+------------------+
|TRANSFER|1879573.5289080725|
|CASH_OUT|175329.74448347004|
| CASH_IN|126508.25527180695|
|   DEBIT|13318.535518284714|
| PAYMENT|12556.450185716356|
+--------+------------------+
</code></pre><h3 id="group-map-udf">Group map UDF</h3>
<p>As with the Group Aggregate UDF, we use <code>groupBy()</code> to divide a Spark <code>DataFrame</code> into manageable batches. Each input batch is mapped over by the Group Map UDF, resulting in a (Pandas) <code>DataFrame</code>, which is then combined back into a single (Spark) <code>DataFrame</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">normalize_by_type</span><span class="p">(</span><span class="n">data</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">result</span> <span class="o">=</span> <span class="n">data</span><span class="p">[[</span><span class="s1">&#39;type&#39;</span><span class="p">,</span> <span class="s1">&#39;amount&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">maxVal</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s1">&#39;amount&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">minVal</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s1">&#39;amount&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">maxVal</span> <span class="o">==</span> <span class="n">minVal</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">result</span><span class="p">[</span><span class="s1">&#39;amountNorm&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.5</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">result</span><span class="p">[</span><span class="s1">&#39;amountNorm&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s1">&#39;amount&#39;</span><span class="p">]</span> <span class="o">-</span> <span class="n">minVal</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">maxVal</span> <span class="o">-</span> <span class="n">minVal</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">result</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># We can use the SQL string-based schema like below comment</span>
</span></span><span class="line"><span class="cl"><span class="c1"># schema = &#39;type string, amount double, amountNorm double&#39;</span>
</span></span><span class="line"><span class="cl"><span class="n">schema</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">StructType</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;amountNorm&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">applyInPandas</span><span class="p">(</span><span class="n">normalize_by_type</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+---------+--------------------+
|    type|   amount|          amountNorm|
+--------+---------+--------------------+
|TRANSFER|    181.0|1.929785364412691...|
|TRANSFER| 215310.3| 0.00232902269229461|
|TRANSFER|311685.89|0.003371535041334062|
|TRANSFER|  62610.8|6.772443276469881E-4|
|TRANSFER| 42712.39|4.619995945019032E-4|
|TRANSFER| 77957.68|8.432543299642404E-4|
|TRANSFER| 17231.46|1.863677235062513...|
|TRANSFER| 78766.03|8.519983994671721E-4|
|TRANSFER|224606.64|0.002429582898990...|
|TRANSFER|125872.53|0.001361558008596...|
+--------+---------+--------------------+
only showing top 10 rows
</code></pre><p>You can see that in the example above, we don&rsquo;t need to explicitly create a UDF. This is due to the use of applyInPandas function which is new in PySpark 3.0.0. The function takes a regular Python function and a result schema as parameters. If you want to create a Group Map UDF, you can refer to the following code:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># It is preferred to use &#39;applyInPandas&#39; over this API (in Spark 3). </span>
</span></span><span class="line"><span class="cl"><span class="c1"># This API will be deprecated in the future releases.</span>
</span></span><span class="line"><span class="cl"><span class="c1"># As it will be deprecated soon, type hint inference is not supported. So, we have to specify PandasUDFType explicitly</span>
</span></span><span class="line"><span class="cl"><span class="n">NormalizeByType</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">pandas_udf</span><span class="p">(</span><span class="n">normalize_by_type</span><span class="p">,</span> <span class="n">schema</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">PandasUDFType</span><span class="o">.</span><span class="n">GROUPED_MAP</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">NormalizeByType</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+---------+--------------------+
|    type|   amount|          amountNorm|
+--------+---------+--------------------+
|TRANSFER|    181.0|1.929785364412691...|
|TRANSFER| 215310.3| 0.00232902269229461|
|TRANSFER|311685.89|0.003371535041334062|
|TRANSFER|  62610.8|6.772443276469881E-4|
|TRANSFER| 42712.39|4.619995945019032E-4|
|TRANSFER| 77957.68|8.432543299642404E-4|
|TRANSFER| 17231.46|1.863677235062513...|
|TRANSFER| 78766.03|8.519983994671721E-4|
|TRANSFER|224606.64|0.002429582898990...|
|TRANSFER|125872.53|0.001361558008596...|
+--------+---------+--------------------+
only showing top 10 rows
</code></pre><p>When executes Group Map UDF, Spark will:</p>
<ul>
<li>Split the data into groups using <code>groupBy</code>.</li>
<li>Apply the function to each group.</li>
<li>Combine the results in a new PySpark <code>DataFrame</code>.</li>
</ul>
<p><img alt="Python Spark User Defined Function Group Map UDF workflow" loading="lazy" src="/posts/pyspark-udfs-a-comprehensive-guide-to-unlock-pyspark-potential/images/pyspark-udf-spark-python-group-map-udf-workflow.png"></p>
<h2 id="conclusion">Conclusion</h2>
<p>In summary, PySpark UDFs are an effective way to bring the power and flexibility of Python to Spark workloads. When used properly, they can help extend Spark&rsquo;s capabilities to solve complex data engineering challenges. Together with the previous tutorial, you can now cover most data manipulation and analysis tasks. Happy coding!</p>
]]></content:encoded></item><item><title>A Practical PySpark tutorial for beginners in Jupyter Notebook</title><link>https://note.datengineer.dev/posts/a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/</link><pubDate>Thu, 08 Feb 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/</guid><description>A hands-on PySpark cheat sheet</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>In today&rsquo;s world of data, the ability to efficiently process and analyze large amount of data is crucial for businesses and organizations. This is where PySpark comes in - an open-source, distributed computing framework built on top of Apache Spark. With its seamless integration with Python, PySpark allows users to leverage the powerful data processing capabilities of Spark directly from Python scripts.</p>
<p>This post was originally a Jupyter Notebook I created when I started learning PySpark, intended as a cheat sheet for me when working with it. As I started to have a blog (a place for my notes), I decided to update and share it here as a complete hands-on tutorial for beginners.</p>
<p>If you are new to PySpark, this tutorial is for you. We will cover the basic, most practical, syntax of PySpark. By the end of this tutorial, you will have a solid understanding of PySpark and be able to use Spark in Python to perform a wide range of data processing tasks.</p>
<h2 id="spark-vs-pyspark">Spark vs PySpark</h2>
<p>What is PySpark? How is it different from Apache Spark? Before looking at PySpark, it&rsquo;s essential to understand the relationship between Spark and PySpark.</p>
<p>Apache Spark is an open source distributed computing system. It provides an interface for programming clusters with implicit data parallelism and fault tolerance. Apache Spark provides API for various programming languages, including Python, Java, Scala, R, making it accessible to various audiences to perform data processing tasks.</p>
<p>PySpark, on the other hand, is the library that uses the provided APIs to provide Python support for Spark. It allows developers to use Python, the most popular programming language in the data community, to leverage the power of Spark without having to switch to another language. PySpark also offers seamless integration with other Python libraries.</p>
<p>In short, Spark is the overarching framework, PySpark serves as its Python API, providing a convenient bridge for Python enthusiasts to leverage Spark&rsquo;s capabilities.</p>
<p><img alt="Apache Spark vs Python PySpark different" loading="lazy" src="/posts/a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/images/apache-spark-python-pyspark-difference-architecture.png"></p>
<h2 id="lets-get-started">Let&rsquo;s get started</h2>
<p>From this point on, you will see Python code doing Spark. This hands-on tutorial will guide you through basic PySpark operations such as querying, filtering, merging, and grouping data. You can find an executable notebook on my <a href="https://gist.github.com/ThaiDat/81c3662801aa8410a65b94f3c993c377">Github</a>.</p>
<h3 id="installation">Installation</h3>
<p>There are several ways to install PySpark. The easiest way for Python users is to use <code>pip</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pip install pyspark
</span></span></code></pre></div><h3 id="sparksession">SparkSession</h3>
<p><code>SparkSession</code> is the entry point for working with Apache Spark. It provides a unified interface for interacting with Spark functionality, allowing you to create DataFrames, execute SQL queries, and manage Spark configurations. Think of it as the gateway to all Spark operations in your application.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Get existed or Create new SparkSession</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">appName</span><span class="p">(</span><span class="s1">&#39;Spark Demo&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">master</span><span class="p">(</span><span class="s1">&#39;local[*]&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span>
</span></span></code></pre></div><pre tabindex="0"><code>SparkSession - in-memory

SparkContext

Spark UI

Version    v3.2.1
Master     local[*]
AppName    Spark Demo
</code></pre><h3 id="load-data">Load data</h3>
<p>PySpark can load data from various types of data storage. In this tutorial we will use the <a href="https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data/data">Fraudulent Transactions Dataset</a>. This dataset provides a CSV file that is sufficient for demo purposes.</p>
<p>The SparkSession object provides <code>read</code> as a property that returns a <code>DataFrameReader</code> that can be used to read data as a <code>DataFrame</code>. The following code reads a csv file into a DataFrame.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Load CSV file to DataFrame</span>
</span></span><span class="line"><span class="cl"><span class="n">data_path</span> <span class="o">=</span> <span class="s1">&#39;../input/fraudulent-transactions-data/Fraud.csv&#39;</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">csv</span><span class="p">(</span><span class="n">data_path</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">inferSchema</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)
</code></pre><p>The inferSchema parameter allows Spark to automatically infer the data types of each column based on the actual data in the file. This involves reading a sample of data, which can be computationally expensive. This can also be incorrect, especially if sample data doesn’t represent the entire dataset well.</p>
<p>Alternatively, to achieve better performance and ensure accurate data types, you can define the schema explicitly.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">types</span> <span class="k">as</span> <span class="n">T</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Read CSV with pre-defined schema</span>
</span></span><span class="line"><span class="cl"><span class="n">predefined_schema</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">StructType</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;step&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">IntegerType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;oldbalanceOrg&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;newbalanceOrig&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span> 
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;oldbalanceDest&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;newbalanceDest&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span> 
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;isFraud&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">IntegerType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;isFlaggedFraud&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">IntegerType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">csv</span><span class="p">(</span><span class="n">data_path</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">predefined_schema</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)
</code></pre><p>The data set contains some misformatted column names. I will rename them all to camel case.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Rename columns</span>
</span></span><span class="line"><span class="cl"><span class="n">corrected_cols</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;oldbalanceOrg&#39;</span><span class="p">:</span> <span class="s1">&#39;oldBalanceOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;newbalanceOrig&#39;</span><span class="p">:</span> <span class="s1">&#39;newBalanceOrig&#39;</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                  <span class="s1">&#39;oldbalanceDest&#39;</span><span class="p">:</span> <span class="s1">&#39;oldBalanceDest&#39;</span><span class="p">,</span> <span class="s1">&#39;newbalanceDest&#39;</span><span class="p">:</span> <span class="s1">&#39;newBalanceDest&#39;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">old_col</span><span class="p">,</span> <span class="n">new_col</span> <span class="ow">in</span> <span class="n">corrected_cols</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">withColumnRenamed</span><span class="p">(</span><span class="n">old_col</span><span class="p">,</span> <span class="n">new_col</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldBalanceOrig: double (nullable = true)
 |-- newBalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldBalanceDest: double (nullable = true)
 |-- newBalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)
</code></pre><h3 id="data-overview">Data Overview</h3>
<p>You can quickly look at the data with <code>DataFrame.show</code> which prints the first n rows to the screen.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Prints top 10 rows of PySpark DataFrame to the screen</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
|step|    type|  amount|   nameOrig|oldBalanceOrig|newBalanceOrig|   nameDest|oldBalanceDest|newBalanceDest|isFraud|isFlaggedFraud|
+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
|   1| PAYMENT| 9839.64|C1231006815|      170136.0|     160296.36|M1979787155|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 1864.28|C1666544295|       21249.0|      19384.72|M2044282225|           0.0|           0.0|      0|             0|
|   1|TRANSFER|   181.0|C1305486145|         181.0|           0.0| C553264065|           0.0|           0.0|      1|             0|
|   1|CASH_OUT|   181.0| C840083671|         181.0|           0.0|  C38997010|       21182.0|           0.0|      1|             0|
|   1| PAYMENT|11668.14|C2048537720|       41554.0|      29885.86|M1230701703|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7817.71|  C90045638|       53860.0|      46042.29| M573487274|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7107.77| C154988899|      183195.0|     176087.23| M408069119|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7861.64|C1912850431|     176087.23|     168225.59| M633326333|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 4024.36|C1265012928|        2671.0|           0.0|M1176932104|           0.0|           0.0|      0|             0|
|   1|   DEBIT| 5337.77| C712410124|       41720.0|      36382.23| C195600860|       41898.0|      40348.79|      0|             0|
+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
only showing top 10 rows
</code></pre><p>In many cases, the result does not fit on the screen and produces unreadable output.</p>
<p><img alt="PySpark load CSV show not fit screen" loading="lazy" src="/posts/a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/images/pyspark-load-csv-show-not-fit-screen.png"></p>
<p>This is where Python comes in. With PySpark, you can mix Python code with Spark APIs to improve the result. The following Python function will show you how to use a Python loop to split and display a sample of data.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Split columns into subsets and show it accordingly</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">show_split</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">split</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">n_cols</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">split</span> <span class="o">&lt;=</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">split</span> <span class="o">=</span> <span class="n">n_cols</span>
</span></span><span class="line"><span class="cl">    <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="n">j</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="n">split</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n_cols</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="o">*</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">j</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">n_samples</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">i</span> <span class="o">=</span> <span class="n">j</span>
</span></span><span class="line"><span class="cl">        <span class="n">j</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="n">split</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">show_split</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+----+--------+--------+-----------+
|step|    type|  amount|   nameOrig|
+----+--------+--------+-----------+
|   1| PAYMENT| 9839.64|C1231006815|
|   1| PAYMENT| 1864.28|C1666544295|
|   1|TRANSFER|   181.0|C1305486145|
|   1|CASH_OUT|   181.0| C840083671|
|   1| PAYMENT|11668.14|C2048537720|
|   1| PAYMENT| 7817.71|  C90045638|
|   1| PAYMENT| 7107.77| C154988899|
|   1| PAYMENT| 7861.64|C1912850431|
|   1| PAYMENT| 4024.36|C1265012928|
|   1|   DEBIT| 5337.77| C712410124|
+----+--------+--------+-----------+
only showing top 10 rows

+--------------+--------------+-----------+--------------+
|oldBalanceOrig|newBalanceOrig|   nameDest|oldBalanceDest|
+--------------+--------------+-----------+--------------+
|      170136.0|     160296.36|M1979787155|           0.0|
|       21249.0|      19384.72|M2044282225|           0.0|
|         181.0|           0.0| C553264065|           0.0|
|         181.0|           0.0|  C38997010|       21182.0|
|       41554.0|      29885.86|M1230701703|           0.0|
|       53860.0|      46042.29| M573487274|           0.0|
|      183195.0|     176087.23| M408069119|           0.0|
|     176087.23|     168225.59| M633326333|           0.0|
|        2671.0|           0.0|M1176932104|           0.0|
|       41720.0|      36382.23| C195600860|       41898.0|
+--------------+--------------+-----------+--------------+
only showing top 10 rows

+--------------+-------+--------------+
|newBalanceDest|isFraud|isFlaggedFraud|
+--------------+-------+--------------+
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      1|             0|
|           0.0|      1|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|      40348.79|      0|             0|
+--------------+-------+--------------+
only showing top 10 rows
</code></pre><p>When working with numerical data, it is not very useful to look at a long series of values. We are often more interested in a few key information points, such as count, mean, standard deviation, minimum, and maximum. PySpark&rsquo;s <code>DataFrame</code> provides <code>describe</code> and <code>summary</code> function with different usage to present these essential metrics.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># DataFrame.describe take columns as params</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">(</span><span class="s1">&#39;step&#39;</span><span class="p">,</span> <span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-------+------------------+------------------+
|summary|              step|            amount|
+-------+------------------+------------------+
|  count|           6362620|           6362620|
|   mean|243.39724563151657|179861.90354913412|
| stddev|142.33197104912588| 603858.2314629498|
|    min|                 1|               0.0|
|    max|               743|     9.244551664E7|
+-------+------------------+------------------+
</code></pre><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># DataFrame.summary take statistics as params</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;oldBalanceOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;newBalanceOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;oldBalanceDest&#39;</span><span class="p">,</span> <span class="s1">&#39;newBalanceDest&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s1">&#39;count&#39;</span><span class="p">,</span> <span class="s1">&#39;min&#39;</span><span class="p">,</span> <span class="s1">&#39;max&#39;</span><span class="p">,</span> <span class="s1">&#39;mean&#39;</span><span class="p">,</span> <span class="s1">&#39;50%&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-------+-----------------+-----------------+------------------+------------------+
|summary|   oldBalanceOrig|   newBalanceOrig|    oldBalanceDest|    newBalanceDest|
+-------+-----------------+-----------------+------------------+------------------+
|  count|          6362620|          6362620|           6362620|           6362620|
|    min|              0.0|              0.0|               0.0|               0.0|
|    max|    5.958504037E7|    4.958504037E7|    3.5601588935E8|    3.5617927892E8|
|   mean|833883.1040744719|855113.6685785714|1100701.6665196654|1224996.3982019408|
|    50%|         14211.23|              0.0|         132612.49|         214605.81|
+-------+-----------------+-----------------+------------------+------------------+
</code></pre><h3 id="query-data">Query data</h3>
<h4 id="select-and-filter">Select and Filter</h4>
<p>PySpark borrowed a lot of vocabulary from the SQL world. But it offers the flexibility that we do not need to follow the strict SQL framework (select &hellip; from &hellip; where &hellip;). Each step of PySpark will return a <code>DataFrame</code> or <code>GroupedData</code> that we can continue to work with normally.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># First .where() filter DataFrame and return another DataFrame</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Then .select() select from the returned DataFrame </span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;CASH_OUT&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">type</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+---------+
|    type|   amount|
+--------+---------+
|CASH_OUT|    181.0|
|CASH_OUT|229133.94|
|CASH_OUT|110414.71|
|CASH_OUT|  56953.9|
|CASH_OUT|  5346.89|
|CASH_OUT|  23261.3|
|CASH_OUT| 82940.31|
|CASH_OUT| 47458.86|
|CASH_OUT|136872.92|
|CASH_OUT| 94253.33|
+--------+---------+
only showing top 10 rows
</code></pre><p>The above example shows us three different ways to access pyspark columns:</p>
<ul>
<li><code>df.type</code>: Access as an attribute.</li>
<li><code>df['type']</code>: Access as an items.</li>
<li><code>F.col('type')</code>: Explicitly specify that we need a column, not a string literal.</li>
</ul>
<p>You can also filter multiple conditions using <code>&amp;</code>, <code>|</code>, and <code>~</code> operator.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># PySpark example filter multiple conditions</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">((</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span><span class="o">==</span><span class="s1">&#39;CASH_OUT&#39;</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">500</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><p>For users who are more familiar with SQL syntax, Spark provides the ability to write SQL queries directly. Before writing SQL queries in PySpark, you need to register your <code>DataFrame</code>. This allows you to reference it in your SQL queries.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Create or replace temp view named &#34;df&#34; from DataFrame df in PySpark</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s1">&#39;df&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Spark SQL query example. You can now reference df in your query</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT type, amount 
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">    WHERE type = &#34;CASH_OUT&#34;    
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+---------+
|    type|   amount|
+--------+---------+
|CASH_OUT|    181.0|
|CASH_OUT|229133.94|
|CASH_OUT|110414.71|
|CASH_OUT|  56953.9|
|CASH_OUT|  5346.89|
|CASH_OUT|  23261.3|
|CASH_OUT| 82940.31|
|CASH_OUT| 47458.86|
|CASH_OUT|136872.92|
|CASH_OUT| 94253.33|
+--------+---------+
only showing top 10 rows
</code></pre><h4 id="aggregating-with-groupby">Aggregating with <code>groupBy</code></h4>
<p>PySpark provides a similar syntax to Pandas for aggregating data.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Example to PySpark groupBy</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Sometimes we can pass column name directly to pyspark functions</span>
</span></span><span class="line"><span class="cl"><span class="c1"># `Column.alias` method change the name of the result column.</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">,</span> <span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT type, AVG(amount) avgAmount
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">    GROUP BY type
</span></span></span><span class="line"><span class="cl"><span class="s1">    ORDER BY 2
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+------------------+
|    type|         avgAmount|
+--------+------------------+
|   DEBIT| 5483.665313767128|
| PAYMENT|13057.604660187604|
| CASH_IN| 168920.2420040954|
|CASH_OUT|176273.96434613998|
|TRANSFER| 910647.0096454868|
+--------+------------------+
</code></pre><p>To filter after groupBy, we can just simply apply <code>where</code> or <code>filter</code> to the result <code>DataFrame</code> object or follow SQL framework with having keyword.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;CASH_OUT&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;sumAmount&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;sumAmount&#39;</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">300000</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameOrig, SUM(amount) sumAmount
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">    WHERE type = &#34;CASH_OUT&#34;
</span></span></span><span class="line"><span class="cl"><span class="s1">    GROUP BY 1
</span></span></span><span class="line"><span class="cl"><span class="s1">    HAVING sumAmount &gt; 300000
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----------+---------+
|   nameOrig|sumAmount|
+-----------+---------+
| C551314014|301050.58|
| C661668091|323789.56|
| C228994633|517946.01|
|C1591008292|558254.22|
|C2100435651|357988.09|
| C624052656|476735.47|
| C948681098|353759.28|
|  C50682517|386128.82|
|C1579521009|684561.18|
|C1871922377|394317.12|
+-----------+---------+
only showing top 10 rows
</code></pre><h4 id="union-and-intersection">Union and Intersection</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">union</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>12725240
</code></pre><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameOrig from df
</span></span></span><span class="line"><span class="cl"><span class="s1">    UNION
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameDest from df
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>9073900
</code></pre><p>We can see the difference in the count here. The reason is that PySpark <code>union</code> function keeps the duplicate samples from two sets. This is equivalent to <code>UNION ALL</code> in SQL. By default, PySpark will not remove duplidates as it is an expensive process. If you want to drop duplicates, you have to do it explicitly.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Union and drop duplicates in PySpark</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">union</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">()</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>9073900
</code></pre><p>Unioning can be useful when we are reading data from multiple files. We can read them one by one in a Python loop and union them.</p>
<p>Intersection is similar to Union. But, keep in mind that PySpark <code>intersect</code> is equivalent to SQL <code>INTERSECT</code>, not <code>INTERSECT ALL</code>.</p>
<h4 id="join">Join</h4>
<p>Very similar to Pandas, <code>DataFrame.join</code> method joins a <code>DataFrame</code> with another <code>DataFrame</code> using the given join expression.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">selectExpr</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;ABS(newBalanceOrig - oldBalanceOrig) changeOrig&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;changeOrig&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgChangeOrig&#39;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;occOrig&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;avgChangeOrig &gt; 100000&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Join the above DataFrame with the one provided in parameter</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">join</span><span class="p">((</span>
</span></span><span class="line"><span class="cl">        <span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="o">.</span><span class="n">selectExpr</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">,</span> <span class="s1">&#39;ABS(newBalanceDest - oldBalanceDest) changeDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;changeDest&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgChangeDest&#39;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">F</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;occDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;avgChangeDest &gt; 100000&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">),</span> <span class="n">on</span><span class="o">=</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span><span class="o">==</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">),</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;inner&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># There are several join method: inner, left, right, cross, outer, left_outer, right_outer, left_semi, left_anti, right_semi, right_anti, ...</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">selectExpr</span><span class="p">(</span><span class="s1">&#39;nameOrig name&#39;</span><span class="p">,</span> <span class="s1">&#39;occOrig + occDest occ&#39;</span><span class="p">,</span> <span class="s1">&#39;avgChangeOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;avgChangeDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;occ&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameOrig name, occOrig + occDest occ, avgChangeOrig, avgChangeDest
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM
</span></span></span><span class="line"><span class="cl"><span class="s1">    (
</span></span></span><span class="line"><span class="cl"><span class="s1">        SELECT nameOrig, AVG(ABS(newBalanceOrig - oldBalanceOrig)) avgChangeOrig, COUNT(*) occOrig
</span></span></span><span class="line"><span class="cl"><span class="s1">        FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">        WHERE type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;
</span></span></span><span class="line"><span class="cl"><span class="s1">        GROUP BY nameOrig
</span></span></span><span class="line"><span class="cl"><span class="s1">        HAVING avgChangeOrig &gt; 100000
</span></span></span><span class="line"><span class="cl"><span class="s1">    )
</span></span></span><span class="line"><span class="cl"><span class="s1">    INNER JOIN
</span></span></span><span class="line"><span class="cl"><span class="s1">    (
</span></span></span><span class="line"><span class="cl"><span class="s1">        SELECT nameDest, AVG(ABS(newBalanceDest - oldBalanceDest)) avgChangeDest, COUNT(*) occDest
</span></span></span><span class="line"><span class="cl"><span class="s1">        FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">        WHERE type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;
</span></span></span><span class="line"><span class="cl"><span class="s1">        GROUP BY nameDest
</span></span></span><span class="line"><span class="cl"><span class="s1">        HAVING avgChangeDest &gt; 100000
</span></span></span><span class="line"><span class="cl"><span class="s1">    )
</span></span></span><span class="line"><span class="cl"><span class="s1">    ON nameOrig = nameDest
</span></span></span><span class="line"><span class="cl"><span class="s1">    ORDER BY occ DESC
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----------+---+------------------+------------------+
|       name|occ|     avgChangeOrig|     avgChangeDest|
+-----------+---+------------------+------------------+
|C1552859894| 43|193711.30000000005| 763241.1652380949|
|C1819271729| 37|         278937.79|283626.17805555544|
|C1692434834| 37|177369.73000000045| 438853.7616666666|
| C889762313| 32|         132731.31|211437.18741935486|
|C1868986147| 32|         120594.03|249840.37709677417|
|  C55305556| 28|319860.45999999903|225565.42111111112|
| C636092700| 26|217273.86000000004|201888.05279999998|
|C1713505653| 25| 278622.8400000003|186625.34916666665|
|C2029542508| 24| 235760.1200000001|231022.98217391354|
| C699906968| 23| 177813.3799999999| 183054.3072727272|
+-----------+---+------------------+------------------+
only showing top 10 rows
</code></pre><p>In the above example, I demonstrated mixing Python Spark and SQL syntax for cleaner code. Instead of the verbose expression:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">((</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span><span class="o">==</span><span class="s1">&#39;CASH_IN&#39;</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span><span class="o">==</span><span class="s1">&#39;CASH_OUT&#39;</span><span class="p">))</span>
</span></span></code></pre></div><p>You can write:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>This style can be applied in various Python Spark functions: <code>selectExpr</code>, <code>where</code>, <code>filter</code>, <code>expr</code>,&hellip; Choose your preferred coding style – PySpark offers the flexibility.</p>
<h2 id="endnote">Endnote</h2>
<p>This tutorial has covered basic Spark operations in both Python and SQL syntax. You will be able to perform most common data transformation and analysis tasks. But your Spark journey doesn&rsquo;t end here! There are more advanced features that were not covered in this article (e.g., UDF). They will be discussed in <a href="../pyspark-udfs-a-comprehensive-guide-to-unlock-pyspark-potential/">another post</a> soon.</p>
]]></content:encoded></item><item><title>Snowflake ID - Simplifying uniqueness in distributed systems</title><link>https://note.datengineer.dev/posts/snowflake-id-simplifying-uniqueness-in-distributed-systems/</link><pubDate>Sat, 03 Feb 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/snowflake-id-simplifying-uniqueness-in-distributed-systems/</guid><description>Guide to generating unique IDs in distributed data systems.</description><content:encoded><![CDATA[<h2 id="problem-description">Problem description</h2>
<p>In developing database systems, generating IDs is a crucial task. IDs ensure the uniqueness of data, facilitate queries, and establish relationship constraints in databases. Most modern database management systems (DBMS) can generate auto-increment IDs. We can delegate this task to the DBMS entirely and not worry about the uniqueness. However, there are several reasons why we shouldn&rsquo;t use auto-increment IDs, especially for distributed systems. The most important reason is that in distributed systems with independent servers, using per-server auto-increment IDs does not guarantee uniqueness and can lead to duplication problems.</p>
<p><a href="https://developer.twitter.com/en/docs/basics/twitter-ids">Snowflake ID</a> is the solution developed by Twitter engineers to address this problem. According to statistics, about 6,000 tweets are written and posted on Twitter every second. How can we generate 6,000 IDs per second independently on multiple servers without collision?</p>
<h2 id="hold-on-what-about-uuid">Hold on! What about UUID?</h2>
<p>UUID is also a widely used ID generation technique that has been used in software for a long time.</p>
<p>The idea of this technique is to use a 128-bit number as an ID. A 128-bit integer means that if we use 6,000 IDs every second, it will take over <code>2^128 / (6000 * 3600 * 24 * 365) = 1.79838*10^27</code> <em>(how is it pronounced, octillion?)</em> to exhaust them. And if we randomly pick 103 trillion of 126-bit numbers, the chance of a collision is one in a billion. Of course, this number is not generated in a simple incremental count like 1, 2, 3, &hellip; but follows a certain standard. And when generated accordingly, UUIDs can solve the problem of generating non-duplicate IDs in distributed systems.</p>
<p>But it introduces another problem. 128-bits is unnecessarily large. And most computers don&rsquo;t support us working directly with the 128-bit integer data type. So we usually have to use strings to process them. In addition, a large-size ID hurts the query performance because the index gets larger and operations/calculations become more costly.</p>
<p><em>Example UUIDs:</em></p>
<pre tabindex="0"><code>8fbb69e1-2132-4c86-911b-4cc182a5513a
b1f352c3-2126-4cca-9eec-349cdb69b611
6c7fa5c6-1b70-4b47-8690-760f2871943d
df495b1b-cd86-4e08-a42a-9f73d2c5afd1
13e1b2c2-ed6e-46ee-94a3-efce635ef268
</code></pre><h2 id="snowflake-id">Snowflake ID</h2>
<p>To solve this problem, Twitter engineers introduced a system called Snowflake ID. The idea of this system is to programmatically generate a 64-bit integer to represent the ID. But how does each server independently generate this ID without collision?</p>
<p>The proposed method is as follows:</p>
<ul>
<li>The first 1 bit is not in use (always 0) to make it fit into a signed integer (always positive).</li>
<li>The next 41 bits store information about the ID creation time, measured in milliseconds from a given point in time (epoch 1288834974657 in Unix time).</li>
<li>The next 10 bits store information about the machine requesting the ID.</li>
<li>The last 12 bits are a sequential counter from 1 to 4096. To avoid duplicate IDs within the same time frame on the same machine.</li>
</ul>
<p><img alt="Snowflake ID Generate ID for database" loading="lazy" src="/posts/snowflake-id-simplifying-uniqueness-in-distributed-systems/images/snowflake-id-format-id-generation-distributed-system-min.png"></p>
<p>The only scenario where collisions can occur is when a single machine requests more than 4096 IDs in a single millisecond (or in the case of Twitter, when a machine posts 4096 tweets in a millisecond). ⁤With Snowflake ID, we are able to solve the problem of generating non-duplicate IDs in distributed systems with a 64-bit integer. ⁤⁤Additionally, Snowflake ID itself contains information about the creation time. ⁤⁤Therefore, we can know the creation time of an ID just by looking at the ID, or we can sort by ID and get results sorted by time. ⁤</p>
<h2 id="conclusion">Conclusion</h2>
<p>Is Snowflake ID the answer for every system? Absolutely not! <strong>Nothing is the answer for everything</strong>. Besides the ID generation strategies mentioned above, there are many other approaches (e.g. Flickr&rsquo;s centralized ticket server). There are always many ways to solve a problem. Each goes with pros and cons. Don&rsquo;t limit yourself to existing methods; always look for new, context-appropriate solutions.</p>
]]></content:encoded></item></channel></rss>