Dat a Engineer

PIVOT and Dynamic PIVOT in SQL - Advanced SQL for analytics

Sun, 19 Jan 2025 00:00:00 +0000

As a data engineer, a typical working day for me, apart from meetings, is full of SELECT, FROM and WHERE. But these basic statements are not enough, especially for the complex ad hoc analysis that is increasingly common nowadays.

SQL is a powerful language. It is a declarative language where we define what we want and the engine finds a way to achieve it. The language is evolving to adapt to the increasing variety of analysis needs. I wrote an article about an advanced SQL feature to deal with hierarchical data. And today, let’s explore another beyond-the-basic feature: PIVOT.

Problem Statement

Imagine you are working as a data engineer for a retail company. The company wants to analyze product sales data to identify trends and opportunities for growth. The data is stored in a table called Sales with the following structure:

ProductID	Date	Amount
101	2024-01-10	300
101	2024-12-15	500
101	2025-01-15	700
101	2025-02-01	1100
102	2024-02-20	800
102	2024-11-03	400
102	2025-01-20	900
102	2025-02-22	650
103	2023-07-25	1200
103	2024-08-15	1500
103	2025-02-10	1250
104	2023-12-05	400
104	2024-06-30	800
104	2025-01-30	300
104	2025-02-25	500

This structure is not good for reports. The company wants this data served in a format where years are represented as columns for easier comparison across products.

GROUP BY - The Amateur Way

A very straightforward approach to this problem is to use the GROUP BY statement. We will group by ProductID, and we will get the sum column for each month. Below is a SQL Server example. Other SQL engines should have similar syntax.

select
    ProductID
    ,sum(iif(year(Date)=2023, Amount, null)) as [2023]
    ,sum(iif(year(Date)=2024, Amount, null)) as [2024]
    ,sum(iif(year(Date)=2025, Amount, null)) as [2025]
from Sales
group by ProductID;

GROUP BY query result

How to create Azure DevOps Pull Requests reporting with Power BI

Sun, 18 Aug 2024 00:00:00 +0000

As a developer, I have always emphasized the importance of code quality and efficient development processes. Modern Git workflows are typically about writing code, commits, pull requests, code reviews, and merges. To gain deeper insight into these processes, I decided to create a Power BI report to track them. My goal is to identify bottlenecks, areas for improvement, and opportunities to streamline our workflow.

Pre-requisites

Before we dive into building the Power BI report, you must have Power BI Desktop of course. It is necessary to have a Personal Access Token that has sufficient access to the project repositories. You will need it to authenticate the API calls from Power BI.

Parameters

To make the report work with different settings, we will use parameters. These parameters allow you to easily apply my code to your project. Just copy the code and edit the following parameters:

_organization: The Azure DevOps organization
_project: Your project. The report will retrieve pull requests from all repositories in the project.
_top: The number of most recent pull requests you want to analyze in the report.

Build the Power Query

Fetch data from Azure DevOps

Now that you have set up your Power BI report set up with parameters and have prepared the necessary credentials, it’s time to pull data from Azure DevOps. While Power BI has a built-in Azure DevOps connector, it only provides board data. To retrieve pull request information, we will need to access the Azure DevOps REST APIs .

See the following Power BI M query:

Source = Json.Document(Web.Contents("https://dev.azure.com/"&_organization&"/"&_project&"/_apis/git/pullrequests?searchCriteria.includeLinks=true&searchCriteria.status=all&$top="&_top&"&api-version=7.1-preview.1")),

The Web.Contents function will pull data from the REST API and return a binary. The Json.Document function will grab this binary and parse it to json. After this step, you should have source as a record which has two attributes:

value: a list of all pull request records.
count: the length of the value list.

Convert to Table

Our previous step resulted in a JSON record containing the pull request data. To make this data available for further analysis, we need to convert it to a table.

#"Converted to Table" = Table.FromRecords({Source}, {"value"}),

The above query convert value to a table in Power BI. The returned table has only one column and one row like below:

value
List

To make the table usable, we need to further transform it. First, we want to explode the list to rows

#"Expanded value" = Table.ExpandListColumn(#"Converted to Table", "value"),

And for each row, we want to expand the record to columns. Note that we don’t necessarily need all the columns. The M query below extracts only the columns we need.

#"Expanded value1" = Table.ExpandRecordColumn(
    #"Expanded value",
    "value",
    {"repository", "pullRequestId", "codeReviewId", "status", "createdBy", "creationDate", "closedDate", "title", "description", "sourceRefName", "targetRefName", "mergeStatus", "isDraft", "mergeId", "reviewers", "labels", "url", "completionOptions", "supportsIterations", "completionQueueTime"},
    {"value.repository", "value.pullRequestId", "value.codeReviewId", "value.status", "value.createdBy", "value.creationDate", "value.closedDate", "value.title", "value.description", "value.sourceRefName", "value.targetRefName", "value.mergeStatus", "value.isDraft", "value.mergeId", "value.reviewers", "value.labels", "value.url", "value.completionOptions", "value.supportsIterations", "value.completionQueueTime"}
),

Continue expanding columns

Even though the previous steps gave us a solid starting point, some columns still have nested records full of useful data. We will perform additional expansions to access this data.

/*
value.repository, value.createdBy, value.completionOptions are records, we can expand them into columns
*/
#"Expanded value.repository" = Table.ExpandRecordColumn(#"Expanded value1", "value.repository", {"name"}, {"value.repository.name"}),
#"Expanded value.createdBy" = Table.ExpandRecordColumn(#"Expanded value.repository", "value.createdBy", {"displayName", "id", "uniqueName"}, {"value.createdBy.displayName", "value.createdBy.id", "value.createdBy.uniqueName"}),
#"Expanded value.completionOptions" = Table.ExpandRecordColumn(#"Expanded value.createdBy", "value.completionOptions", {"mergeCommitMessage", "mergeStrategy", "transitionWorkItems"}, {"value.completionOptions.mergeCommitMessage", "value.completionOptions.mergeStrategy", "value.completionOptions.transitionWorkItems"}),
/*
value.reviewers is otherwise a list of records. For each list, we will concat all displayName of each record
*/
#"Expanded value.reviewers" = Table.TransformColumns(#"Expanded value.completionOptions", {{"value.reviewers", each Combiner.CombineTextByDelimiter(", ")(List.Transform(, each [displayName]))}}),

Add details from other APIs

While the pull request endpoint provides us with a lot of useful information, it might not be enough. We often need to supplement our data with information from other Azure DevOps APIs to gain deeper insights. The process is pretty similar with what we have done so far: pulling data from API and expanding JSON objects.

Iterations

Iterations are created as a result of creating and pushing updates to a pull request. The number of iterations is equal to the number of updates made after pull requests are created. Below is the Power BI M query to get the number of iterations for each pull request:

#"Added iterations" = Table.AddColumn(#"Expanded value.reviewers", "iterations", each Json.Document(Web.Contents([value.url]&"/iterations/"))),
#"Expanded iterations" = Table.ExpandRecordColumn(#"Added iterations", "iterations", {"count"}, {"iterations.count"}),

Changes

Another good metric to track is the number of files changed in each pull request. And we need to have the changes in all iterations, not just the initial pull request. Below is the code to retrieve the data from the API and extract the required information.

#"Added iterations.changes" = Table.AddColumn(#"Expanded iterations", "iterations.changes", each Json.Document(Web.Contents([value.url]&"/iterations/"&Number.ToText([iterations.count])&"/changes?api-version=7.1-preview.1"))),
#"Expanded iterations.changes" = Table.ExpandRecordColumn(#"Added iterations.changes", "iterations.changes", {"changeEntries"}, {"iterations.changes.changeEntries"}),
#"Added iterations.changes.changeEntries.count" = Table.AddColumn(#"Expanded iterations.changes", "iterations.changes.changeEntries.count", each List.Count([iterations.changes.changeEntries])),
#"Removed iterations.changes.changeEntries" = Table.RemoveColumns(#"Added iterations.changes.changeEntries.count",{"iterations.changes.changeEntries"}),

Threads

Threads are an Azure DevOps object for managing and organizing pull request discussions. Team can discuss specific changes directly by adding one or more comments to each thread. Analyzing threads can give us many useful insights.

#"Added threads" = Table.AddColumn(#"Removed iterations.changes.changeEntries", "threads", each Json.Document(Web.Contents([value.url]&"/threads?api-version=7.1-preview.1"))),
#"Expanded threads" = Table.ExpandRecordColumn(#"Added threads", "threads", {"value"}, {"threads.value"}),

For example, we can count the comment threads. A comment thread should have the status attribute (Active, Resolved, Closed)

#"Added threads.value.commentCount" = Table.AddColumn(#"Expanded threads", "threads.value.commentCount", each List.Sum(List.Transform([threads.value], each Number.From(Record.HasFields(_, "status"))))),

Or we can get the approval or rejection information from the vote thread. A vote thread has a CodeReviewThreadType attribute with value VoteUpdate. And if the value of CodeReviewVoteResult is greater than 0, it is an approval. Otherwise, it is a rejection. The below M query get the fist approval time of a pull request.

#"Added threads.value.firstApprovalTime" = Table.AddColumn(
    #"Added threads.value.commentCount",
    "threads.value.firstApprovalTime",
    each List.Min(
        List.Transform(
            [threads.value],
            each if
                Record.HasFields(_[properties], "CodeReviewThreadType") and Record.Field(_[properties][CodeReviewThreadType], "$value") = "VoteUpdate"
                and Record.HasFields(_[properties], "CodeReviewVoteResult") and Number.FromText(Record.Field(_[properties][CodeReviewVoteResult], "$value")) > 0
            then _[publishedDate]
            else null
        )
    )
),

Full source code

You can grab the source code, paste it into the Power BI Power Query advanced editor, and customize it to suit your needs.

Full Query

Visualize insights

Now we have a rich dataset. Power BI offers a wide range of visual elements to help you uncover trends, patterns, and insights. It’s time to bring our data to life with stunning visualizations.

Conclusion

Remember, this is just the beginning. As your project evolves and your data grows, you can expand your report to include additional metrics, refine visualizations, and explore new insights. Continuous improvement is essential to maximizing the value of your data.

By creating a comprehensive pull request report, you are taking the initial step toward establishing a culture of data-driven decision-making, first within your development team, then throughout your organization.

How to start a successful Data Warehouse project

Sun, 11 Aug 2024 00:00:00 +0000

Any organization aiming to leverage the power of data-driven decision-making stands to benefit greatly from a successful Data Warehouse project. A well-designed Data Warehouse not only centralizes your data but also guarantees that it is reliable, scalable, maintainable, and usable by stakeholders.

Over the past few months, my team and I have launched a new Data Warehouse project in production. The opportunity to start from scratch is always a valuable chance to gain new insights and expertise. I would like to share the experiences from this success story in the hope that they will be as beneficial to others as they have been to us.

Understand Business Requirements

The first step in starting any project, not only a Data Warehouse, is to fully understand the business requirements. This is the difference between success and failure, not just a formality. If you skip this step, I can tell you with certainty that your project will be a waste of time, energy, and resources.

To really understand what business wants to see and what your team needs to do, it’s essential to spend time talking to the people who will be using the Data Warehouse. What do they hope to accomplish? How will it help them do their jobs better? How do they plan to use the data? Getting a clear picture of their goals is essential to making sure your project is on the right track.

However, this is where things often get complicated. People usually do not understand each other, especially people in different departments who have different perspectives, priorities, and terminologies. Sometimes people do not even understand what they are saying. Business guys are the ones who are easily attracted to marketing buzzwords on the Internet believing that these terms are the solutions to their problems. I have to say that the marketing departments of data companies do a really good job of re-inventing new names for the similar term. During this project, there were dozens of times the guys told me let’s use this tool, why not use this technology, money is not a problem (until they actually got the bill).

In one of my previous projects, a stakeholder told me that he wanted a visually stunning real-time dashboard that would make the numbers dance instantly whenever users did something in the web application. And I had to explain to him:

Visually stunning: Yes, the data analysts team can always help you with that.
Real-time: There is no real time. If the sun disappears, we can know it only after 8 minutes. So does the data.
We do not really need it. Business is not going to sit still and watch the numbers dance every second.

Patience is the key. They do not understand those technical buzzwords. Yes. But isn’t that why you are here as a technical specialist? Your responsibility is to listen to them, understand them, empathize with them, and tell them what you will do to help them. Your job is to translate their requirements into a workable solution.

Remember that the business stakeholders are not only the end users, but also the investors. Without their buy-in, the project can’t even get off the ground. They are funding the project, and they deserve the best service.

By starting with a clear understanding of business requirements, you set the stage for a Data Warehouse project that is aligned with the organization’s goals, ensuring that the final product delivers real value.

Understand System

A Data Warehouse is not an isolated island. It is more like a bustling city that relies on a network of interconnected systems. It receives supplies from surrounding farms and industrial areas. Since Data Warehouse pulls data from other systems, you can not build a successful Data Warehouse without understanding how the other systems work.

Imagine stakeholders telling you they want the sales figure. Then you need to know exactly which systems hold the sales number. How is that number populated in each system? It may be manually entered by users, it may be automatically calculated, it may be synchronized from other sources, it may be read-only or editable… You need to know all the surrounding information to decide the source of truth for the number we desire. You may argue that all you need to do is copy the source database over and the business will know what to do with the data. Believe me, they don’t. In fact, they have never seen the database a day in their lives. And you are the one who will tell them what they can do with your Data Warehouse.

And not knowing how the system works also risks your project design. You certainly don’t want to discover a surprise when you’re almost done with the implementation, such as a scheduled job that archives data from the database daily. If you had known that from the beginning, your design would have been very different.

Understanding the entire system in detail can be time-consuming. You should have a good sense of how the interconnected systems work together, but don’t expect to understand them in detail at the beginning of your project. Instead, I would suggest building strong relationships with the teams responsible for maintaining these systems. Meet with them, tell them what you are doing, and ask for their advice and insights. They are a goldmine of information. You can also experiment with sandbox environments and databases to uncover hidden patterns and processes.

Design a reliable Data Warehouse

Reliability is the backbone of any Data Warehouse. If your business can’t rely on the data coming out of your Data Warehouse, your project is completely a failure.

Having a solid testing strategy will greatly help. Testing is not just about finding bugs, it’s about building confidence. When you start designing the data warehouse, think less about the time when the system is running happily, there is nothing for us to do if the system keeps running as it should. Think more about the time when the system is not working and what we are going to do in that time.

And even if you do your best, bugs and issues will still happen. Don’t expect your system to be bug-free; instead, build processes to handle issues as soon as they arise. And most importantly, be transparent. If the business comes to you and asks about an issue they found, tell them what happened and what you are doing to help. Transparency is the key to trust. If you tell a lie, you are part of the problem; if you are transparent, you are part of the solution. A reliable Data Warehouse isn’t just about technology. It’s about building trust.

Choose the right tool for the right job

To build a Data Warehouse, you need a toolbox filled with different pieces to complete the picture: tools for copying data, transforming it, orchestrating jobs, and more. It is technically possible to create the tools yourself, especially if you are in a big corporation and want to control every aspect of the technology. However, in most of the cases, it is impractical. You do not have enough resources to own the technology. Thus, developing a Data Warehouse solution usually means picking the available tools and services and making them work together.

The real challenge is choosing the right tools. Beware of your enemies, the shiny marketing promises. The person who writes those buzzwords may not be the one who writes the code. Sometimes I don’t understand what they wrote, and I think they don’t understand what they wrote either. These tools are very expensive. It is important to avoid overkill. Focus on what your business really needs, not just what sounds cool. We are not going to use the most popular or the most expensive tools; we are going to find the right fit for our specific needs.

Start small, Grow big

Your investors do not have infinite patience. They want to see progress and value. Building something small but functional is far better than promising a grand project that never finish. By starting small, you can quickly deliver value and gather feedback from users.

With limited resources, we can not get everything done at once. It is important to prioritize. What matters most to your business? What will deliver the biggest impact to your customers? Concentrate on delivering those core features first. You can break the project into phases, which is a good practice. Each phase focuses on specific business requirements, data sources, or user groups. And you can gradually expand the capabilities of the Data Warehouse.

Engage users

A Data Warehouse is not just a technical marvel. It is a tool for your business. To ensure it delivers maximum value, you need to involve your users from the very beginning.

Imagine building a house without consulting the people who will live in it. People can still live in it, but they never feel it is their home. By involving them early and often, you will gain valuable insight into their needs, expectations, and challenges.

How can you engage your users?

Involve them in the planning phase: Understand their data needs, pain points, and desired outcomes.
Provide regular updates: Keep them informed about project progress and involve them in decision-making.
Offer training and support: Equip users with the skills to effectively use the Data Warehouse.
Gather feedback: Encourage users to share their thoughts and suggestions for improvement.

Remember that if you can not engage your users, any slightly higher number in their reports will quickly become your problem. If you can engage them and make them feel like they are part of the project, then any issue will become everyone’s problem.

Conclusion

Building a successful Data Warehouse is a challenging journey that requires careful planning, execution, and continuous improvement. It all starts with a deep understanding of the business requirements to ensure that every decision is aligned with the organization’s goals. Start small, iterate often, and always keep the user at the center of your efforts. A successful Data Warehouse is a collaboration between the engineering team and the business. By working together, you can create a solution that truly delivers value.

Understand Row-Oriented vs Column-Oriented Storage

Fri, 05 Apr 2024 00:00:00 +0000

The way we access and analyze data has changed a lot lately. Row-oriented storage, which has been the standard for data storage for a long time, is having trouble keeping up with the demands of modern data analysis. In this article, I will introduce you to column-oriented storage and how it can help analytical queries run faster.

OLAP

In my previous post, we discussed the differences between Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP). As a reminder, OLAP which is the access pattern of analytical queries typically:

Consume a large number of records
Focus on only a specific subset of columns from each record
Aggregate data to calculate statistics (e.g., averages, sums)

Row-Oriented Storage

Row-oriented storage, a type of storage engine optimized for OLTP, stores all values belonging to a single row near each other. The entire row is essentially stored as a sequence of bytes and is usually indexed for quick retrieval. When you provide a key, the database efficiently locates the physical location of the row on disk. It then goes to that address, loads the sequence of bytes into memory, and parses it to extract the specific values you need. Let’s think of it like a csv file. A row is stored as a string of characters. If you want to access the 10th row, you have to scan the file for the 10th line break, read all the characters until you reach the next line break. Now you have to parse the result by separating it by commas to get the information you want.

While row-oriented storage is great for reading and writing individual records, it quickly becomes less suitable when faced with the demands of OLAP:

Index, the data structure behind the ability of most row-oriented data storage to quickly locate the data, doesn’t work with analytical queries. Because analytical queries are OLAP, they don’t access data using a specific key or ID. Instead, they often use multiple conditions, such as date created within a year or product category is of some specific types. Any column in the table can be used in the where clause, and we can’t just create a separate row-based index for each column.
Reading a single row in row-oriented storage requires loading the entire sequence of bytes from disk into memory. Thus, reading a huge number of rows with hundreds of columns (which is typical in OLAP) quickly becomes inefficient.

Column-Oriented Storage

Column-oriented storage is based on a simple idea: instead of storing all the values from one row together, just store all the values from each column together. Because the data is organized by column, a query only needs to access and process the columns that are relevant to its needs. This significantly reduces the amount of data that needs to be transferred and parsed, resulting in dramatic performance gains.

Let’s look at the example below. A sales table stored in a row-oriented format looks like this.

DATE	PRODUCT_KEY	CUSTOMER_KEY	QUANTITY	DISCOUNT	PAYMENT_METHOD	…
2023/12/28	2	13	3	0.00	card	…
2024/01/11	2	49	5	0.00	bank	…
2024/01/16	8	49	101	15.00	card	…
2024/01/21	6	55	5	5.00	card	…
2024/02/02	5	26	2	0.00	bank	…
…	…	…	…	…	…	…

A column-oriented storage serializes all values in a column and store them together (as a sequence of bytes). For our example table, the data would be stored in this way:

Column	Row 1	Row 2	Row 3	Row 4	Row 5	…
DATE	2023/12/28	2024/01/11	2024/01/16	2024/01/21	2024/02/02	…
PRODUCT_KEY	2	2	8	6	5	…
CUSTOMER_KEY	13	49	49	55	26	…
QUANTITY	3	5	101	5	2	…
DISCOUNT	0.00	0.00	15.00	5.00	0.00	…
PAYMENT_METHOD	card	bank	card	card	bank	…
…	…	…	…	…	…	…

Advantages of Column-Oriented Storage

Reading data from column-oriented storage provides several key advantages over traditional row-oriented storage, especially for analytical workloads:

Column compression: Due to the denormalization nature in modern data warehouse, values in a column tend to be repeated. Many popular compression algorithms, such as LZW or run-length encoding, make use of the similarity of adjacent data to optimize data size. Look at the column PAYMENT_METHOD in our example. What if, instead of storing a full 4-byte string, we only needed 1 bit for it: 0 for card and 1 for bank? Now the whole column becomes one long bitmap where each row consumes only 1 bit on disk.
Access time: Disk access is a real bottleneck. When working with data on disk, we always need to use a different set of data structures and algorithms to minimize access time (B-tree for example). By accessing only the data needed to process the query and using data compression strategy, we can scan more rows in a single read. This means fewer reads to scan an entire table with trillions of rows, and therefore less disk access time.
Throughput: Fetching only the necessary columns and better data compression also lead to better throughput, or the amount of data processed in a given time. Throughput is extremely important when compute and storage are not in the same place and data must be transferred over the network.

Conclusion

Each database implementation can vary in its specific optimization. However, the fundamental principle - storing and processing data by column rather than by row - remains the same, leading to significant performance gains for analytical queries. Understanding how your database works behind the scenes is beneficial for you as an engineer. Knowing what your tool does also means knowing what you do.

OLTP & OLAP - Why we need Data Warehouse

Wed, 28 Feb 2024 00:00:00 +0000

Today, I was advising a team on building their data warehouse solution. I realized that even 40 years after the term “data warehouse” was first introduced, there are still questions about why we need a data warehouse and why we don’t get all of the data from application databases, especially by executives. I write this post to answer these questions by clarifying the terms OLTP and OLAP, which are frequently used in discussions about data warehouse database architecture. Then I will explain why OLTP databases are inefficient for OLAP queries and why you need a separate database known as a data warehouse.

OLTP

OLTP, or Online Transaction Processing, is a pattern by which we access and manipulate data in the database transaction-by-transaction. A transaction refers to a single unit of work, such as a money transfer, a book, a blog post, and so on. Typically, users often only interact with one or a few transactions at a time. Therefore, most of the time, applications look up a small number of records in databases by some keys. Application databases implement special indexing techniques such as B-tree or LSM-tree to handle OLTP efficiently. They can quickly access a particular transaction given its indexed key.

OLAP

As businesses grow and accumulate data, they need to analyze it to gain valuable insights about their market and customers. Then they can make informed decisions and gain competitive advantage. When it comes to analytics, access patterns will be very different. Typically, analytic queries consume a large number of records, look for only a few specific columns of each record, and often aggregate data to calculate statistics (min, max, sum, average,…). This pattern of accessing data in the database is called Online Analytic Processing (OLAP).

Difference between OLTP and OLAP

From the above definitions, we can somehow distinguish OLTP vs OLAP. The following table shows the typical differences:

	OLTP	OLAP
Access	Small number of records, using indexed keys	Large amount of records, often aggregate
Purpose	Application transactional consistency and speed	Complex queries and analysis
Users	Application end users	Analysts and business users.
Data volume	Relatively small, frequently accessed	Large datasets, accessed less frequently
Data type	Real-time, current data	Historical, aggregated data

* Differences between OLTP and OLAP

Problems of OLTP Databases with OLAP queries

When your business is still young, it is easy to run analysis directly on application databases. However, as the volume of data and the need for analysis grows along with the business, problems arise. Databases that were optimized for OLTP using indexing techniques such as LSM tree or B-tree now struggle to execute OLAP queries efficiently. As a result, running OLAP queries becomes costly and negatively impacts application performance, which is critical to business success.

As the business continues to grow, different business units tend to operate independently with their own goals, priorities, concerns, and IT budgets. Each business unit will maintain its own applications running on separate databases. Performing analysis when data is scattered in different locations is difficult. And analysts often end up exporting data from different places, putting it into a single Excel file, and VLOOKUP.

Data Warehouse

In response to the challenges of running OLAP queries on traditional business databases, the concept of a data warehouse is emerging as a solution.

Data Warehouse functions as a dedicated space for analytical purpose. It allows business to store massive amounts of historical and current data without impacting operational databases.
Data Warehouses are designed with a focus on analytical processing. Their storage engines use specialized techniques to speed up OLAP queries. We may explore these techniques in other posts.
Data warehouses serve as a centralized repository for data from various sources. Analysis is now easier because all of the necessary data is in a single place.

Conclusion

In this post, we’ve gone over the definitions and differences between OLTP and OLAP. We also looked into the role of the data warehouse in conducting business analysis. Understanding them should give you the confidence when you say to your boss, “We need a data warehouse.”

Recursive CTEs and CONNECT BY in SQL to query Hierarchical data

Tue, 20 Feb 2024 00:00:00 +0000

In database design, the idea of hierarchical data represents relationships between entities as a tree-like structure. This type of data model is widely used in many domains, such as file systems, organizational structure, etc. When dealing with hierarchical data, it is crucial to efficiently query and extract information about the relationships between entities. In this post, we will explore two powerful SQL tools for querying hierarchical data: recursive Common Table Expressions (CTEs) and the CONNECT BY clause.

Hierarchical Data

Hierarchical data represents a natural parent-child relationship that is often visualized in the form of a tree structure. Imagine a family tree: grandparents on top, parents in the middle, and you and siblings at the bottom, all connected. That’s hierarchical data! It organizes information in levels, making it easy to understand how things are related. The most popular real-life example of hierarchical data is employee-manager relationships. Every employee is managed by a manager. And that manager is also an employee and (again) is managed by another manager who is also an employee.

Hierarchical Data representation in SQL

Relational models perform best with flat tables with rows and columns, not tree-like structures. However, people developed techniques to represent hierarchical data in SQL. The most common approach is referencing using foreign key. In the above example, we will add a column manager_id referring to the person who managed this employee to the employee table.

EMPLOYEE_ID	NAME	SALARY	MANAGER_ID
1	Adam	60000	NULL
2	John	30000	1
…	…	…	…

*Example SQL table representing hierarchical data structure

Querying Hierarchical Data

When querying hierarchical data, we often want to understand the relationship in both directions: who manages whom and who is managed by whom. However, querying hierarchical data is tricky because we don’t know the depth of the tree, i.e. how many levels of hierarchy there are. Before we look at how to do this in SQL, let’s prepare some data to work with. Note that all SQL code in this post is Oracle, as it natively supports CONNECT BY. Other RDBMS SQL should be similar.

Example data

What is a reliable Data System?

Fri, 16 Feb 2024 00:00:00 +0000

In today’s data-driven world, information is gold, and the systems that store and manage it serve as crucial infrastructure. I have seen people talk a lot about terms like “distributed computing”, “scalability”… but one fundamental characteristic is often overlooked: reliability. Without it, scalability, maintainability, flexibility, anything-bility are meaningless, like a beautiful castle built on sand.

What is Reliability?

Everyone has their own intuition about what is reliable:

A piggy bank is reliable because it consistently holds your money and accurately reflects what you’ve deposited. You trust that when you put a coin in, it will be there later, and the total will reflect your savings. And when you want to make a withdrawal, you can get your money immediately.
A calculator is reliable because it consistently produces accurate results based on your input. You trust that regardless of who uses it, 2 + 2 will always equal 4. And the result should appear instantly on the screen.

Different systems have different reliability requirements. In general, we can define reliability as follow:

Reliability refers to the ability to always do the expected things in the expected way.

For software, reliability means consistently performing the designed function at the expected level of performance. Consider a calculator: we expect it to immediately display 4 after typing in 2+2. If it shows me 5, I will give it 1 star and never use it again. If it takes me 5 minutes to do such a simple arithmetic addition, I will send an email to the United Nations to report it as crypto-mining malware. (actually I won’t)

Wait a minute! There is one more important word in my definition above: “always”. What do I mean by “always”? A piggy bank wouldn’t be very reliable if it held my money and suddenly became inaccessible for a week. Of course, there is no perfect “always” in real world. There may be unforeseen situations that cause systems to stop working. But systems should be designed in such a way that the disruption doesn’t hurt business operations. Reliability focuses on minimizing the occurrence of system failures and their impact on functionality.

Reliable Data system

Just like you trust your piggy bank to hold your coins securely, you need to trust your data systems to hold your information reliably. Your piggy bank wouldn’t be very reliable if the coins sometimes disappeared, a data system wouldn’t be reliable if the information kept changing or disappearing. Reliability means you can trust the information it holds. This means the data is always available, accurate, and delivers consistent results when you need it. Common expectations for a data system:

Integrity: This ensures the data is accurate, complete, and consistent. Imagine your piggy bank if someone took coins without putting them back, or if different amounts appeared out of nowhere. It wouldn’t be reliable! Similarly, data integrity prevents missing, incorrect, or inconsistent information, thereby ensuring its reliability.
Availability: You wouldn’t find your piggy bank locked when you need it most. Likewise, reliable data systems must be accessible when you need them. This means the data is readily available for authorized users, minimizing downtime and ensuring critical information is always at hand.
Performance: A sluggish piggy bank wouldn’t be very useful. Similar to how you expect quick access to your coins, data systems should deliver reasonable performance. This translates to fast retrieval times, smooth operation, and responsiveness to your needs, enabling efficient decision-making.
Timeliness: Data freshness is crucial. Old coins are worth the same, but old data is not. In data systems, timeliness ensures that information is current and up to date. This reduces reliance on outdated data, resulting in more accurate insights and informed actions.
Safety: Just like keeping your piggy bank safe from theft, protecting your data is critical. Data safety ensures that information is protected from unauthorized access. If someone you don’t trust knows where you keep your piggy bank, you won’t put any coins in it.

How important is Reliability

Reliability is not limited to life-or-death situations such as nuclear power plants. It is fundamental to all software applications, large and small. Sure, bugs in a note taking app may not have catastrophic consequences, but they do cause frustration and erode user trust. Let’s shift our focus from “avoiding disaster” to “delivering value”. Every software application has a purpose, whether it’s to simplify tasks, improve communication, or entertain users. When an application crashes, malfunctions, or produces incorrect results, it fails to fulfill its purpose. Every software application has a responsibility to its users. Frustrated users abandon unreliable applications, businesses lose productivity, and trust erodes. Investing in reliability is about more than avoiding the negative consequences of failure. It’s about building trust, delivering value, and ensuring that your software does what it’s supposed to do.

PySpark UDFs: A comprehensive guide to unlock PySpark potential

Fri, 09 Feb 2024 00:00:00 +0000

Introduction

Apache Spark is a powerful open source distributed computing engine designed to handle large datasets across clusters. PySpark is the Python programming API for Spark. It allows data engineers and data scientists can easily utilize the framework in their preferred language.

This post is a continuation of the previous tutorial. Originally a Jupyter notebook I created while learning PySpark, I recently found it and decided to update it and publish it on my blog.

UDFs (user-defined functions) are an integral part of PySpark, allowing users to extend the capabilities of Spark by creating their own custom functions. This article will provide a comprehensive guide to PySpark UDFs with examples.

Understanding PySpark UDFs

PySpark UDFs are user-defined functions written in Python code. We create functions in Python and register them with Spark as UDFs. They enable the execution of complicated custom logic on Spark DataFrames and SQL expressions.

However, note that UDFs are expensive. We should always prefer built-in functions whenever possible. PySpark comes with a number of predefined common functions, and many more new functions are added with each new release.

In summary, with PySpark UDFs, what goes in is a regular Python function, and what goes out is a function to work on the PySpark engine.

Creating an UDF

All of the following examples are a continuation of the previous article. You can find an executable notebook containing both articles here.

Below is an example of a “complicated” decision tree function that classifies transactions:

# PySpark UDFs example
def classify_tier(amount:float) -> int:
    if amount < 500:
        return 0
    if amount < 10000:
        return 1
    if amount < 100000:
        return 2
    if amount < 1000000:
        return 3
    return 4

It is a regular Python function that receive a float and return an int. We have to make it a PySpark UDF before actually using it.

from pyspark.sql import functions as F

# pyspark.sql.functions provides a udf() function to promote a regular function to be UDF.
# The function takes two parameters: the function you want to promote, and the return type of the generated UDF
# The function return a UDF
classifyTier = F.udf(classify_tier, T.ByteType())

Then we can use it like any other PySpark function.

df.select('nameOrig', classifyTier(df.amount).alias('tier')).orderBy('tier', ascending=False).show(10)

+-----------+----+
|   nameOrig|tier|
+-----------+----+
|C1495608502|   4|
|C1321115948|   4|
| C476579021|   4|
|C1520267010|   4|
| C106297322|   4|
|C1464177809|   4|
| C355885103|   4|
|C1057507014|   4|
|C1419332030|   4|
|C2007599722|   4|
+-----------+----+

The pyspark.sql.functions.udf() function can also be used as a decorator which produce the same result.

# pyspark udf decorator example
# Note that classifyTier is a UDF, not a regular function anymore.
@F.udf(T.ByteType())
def classifyTier(amount:float) -> int:
    if amount < 500:
        return 0
    if amount < 10000:
        return 1
    if amount < 100000:
        return 2
    if amount < 1000000:
        return 3
    return 4

If you want to use it in a Spark SQL expression, we need to register it first.

# Register the regular Python function with spark.udf.register
spark.udf.register('classifyTier', classify_tier)

spark.sql('''
    SELECT nameOrig, classifyTier(amount) tier
    FROM df
    ORDER BY tier DESC 
''').show(10)

+-----------+----+
|   nameOrig|tier|
+-----------+----+
| C263860433|   4|
| C306269750|   4|
|C1611915976|   4|
|C1387188921|   4|
| C300262358|   4|
| C389879985|   4|
|C1907016309|   4|
|C1046638041|   4|
|C1543404166|   4|
|C1155108056|   4|
+-----------+----+

Simple enough? Write a Python function, make it a UDF, use it. But it is not the most interesting part.

Pandas UDF

With Python UDFs, PySpark will unpack each value, perform the calculation, and then return the value for each record. A Pandas UDF is a user-defined function that works with data using Pandas for manipulation and Apache Arrow for data transfer. It is also called a vectorized UDF. Compared to row-at-a-time Python UDFs, pandas UDFs enable vectorized operations that can improve performance by up to 100x.

Series to Series UDF

These UDFs operate on Pandas Series and return a Pandas Series as output. When Spark runs a Pandas UDF, it divides the columns into batches, calls the function on a subset of the data for each batch, and then concatenates the output. It is preferable to use a Pandas Series-to-Series UDF if possible, instead of using a regular Python UDF. We use pyspark.sql.functions.pandas_udf to create a Pandas UDF.

import pandas as pd


# You can also promote the function to PySpark Pandas UDF as getUserType = F.pandas_udf(get_user_type, T.StringType())
# Each User ID starts with a letter represent its type
@F.pandas_udf(T.StringType())
def getUserType(name: pd.Series) -> pd.Series:
    return name.str[0]

The only difference in syntax is that the Python function now takes a pandas.Series' and returns a pandas.Series’. And then we can use it as a Spark function.

(
    df.select(getUserType(df.nameDest).alias('userTypeDest'), df.amount)
    .groupBy('userTypeDest')
    .agg(
        F.mean('amount').alias('avgAmount'),
        F.count('*').alias('n')
    )
    .orderBy('avgAmount', ascending=False)
    .show(10)
)

+------------+------------------+-------+
|userTypeDest|         avgAmount|      n|
+------------+------------------+-------+
|           C| 265083.4571810173|4211125|
|           M|13057.604660187604|2151495|
+------------+------------------+-------+

Iterator of Series to Iterator of Series

Due to the distributed nature of Spark, the entire series is not fed into the UDF at once; instead, each cluster calls the UDF on its own batch of data and then aggregates the result. PySpark Iterator of Series to Iterator of Series UDFs are very useful when we have an time-consuming cold start operation (e.g. initialize a machine learning model, check for some server statuses,…) that you need to perform once at the beginning of the processing step.

from time import sleep
from typing import Iterator, Tuple


@F.pandas_udf(T.ByteType())
def getNameIdLength(name: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # Heavy task
    # sleep(5)
    
    # name is a Iterator
    # name_batch is a pd.Series
    for name_batch in name:
        name_len = name_batch.str.len()
        name_len[~name_batch.str[0].str.isnumeric()] -= 1
        # yield because we return an iterator
        yield name_len

(
    df.select(getNameIdLength(df.nameOrig).alias('idLen'), 'amount')
    .groupBy('idLen')
    .agg(F.mean('amount').alias('avgAmount'))
    .orderBy('avgAmount')
    .show(10)
)

+-----+------------------+
|idLen|         avgAmount|
+-----+------------------+
|    4|155070.73742857145|
|    7|177477.50726081585|
|   10| 179702.4408980949|
|    9|179898.05510125632|
|    8| 181572.2097899971|
|    6|197756.81529433408|
|    5|199594.79368029739|
+-----+------------------+

Iterator of multiple Series to Iterator of Series UDF

Iterator of Multiple Series to Iterator of Series UDF has the same characteristics as Iterator of Series to Iterator of Series UDF. The difference is that the underlying Python function receives an iterator for a tuple of Pandas Series.

def amount_mismatch(values: Iterator[Tuple[pd.Series, pd.Series, pd.Series, pd.Series]]) -> Iterator[pd.Series]:
    # Heavy task
    # ...

    for oldOrig, newOrig, oldDest, newDest in values:
        yield abs(abs(newOrig - oldOrig) - abs(newDest - oldDest))

# Create an UDF. You can also use decorator.
amountMismatch = F.pandas_udf(amount_mismatch, T.DoubleType())

(
    df.select(
        df.type,
        amountMismatch(df.oldBalanceOrig, df.newBalanceOrig, df.oldBalanceDest, df.newBalanceDest).alias('mismatch')
    )
    .groupBy('type')
    .agg(
        F.mean('mismatch').alias('avgMismatch')
    )
    .orderBy('avgMismatch', ascending=False)
    .show(10)
)

+--------+------------------+
|    type|       avgMismatch|
+--------+------------------+
|TRANSFER| 968056.4538892006|
|CASH_OUT|170539.39652580014|
| CASH_IN| 50038.95466155722|
|   DEBIT| 25567.53969902471|
| PAYMENT| 6378.936662041953|
+--------+------------------+

Group aggregate UDF

Group aggregate UDF, also known as the Series to Scalar UDF, reduces the input pandas.Series into a single value.

@F.pandas_udf(T.DoubleType())
def getStdDeviation(series: pd.Series) -> float:
    # Use built-in pandas.Series.std
    return series.std()

(
    df.groupBy('type')
    .agg(
        getStdDeviation(df.amount).alias('var')
    )
    .orderBy('var', ascending=False)
    .show(10)
)

+--------+------------------+
|    type|               var|
+--------+------------------+
|TRANSFER|1879573.5289080725|
|CASH_OUT|175329.74448347004|
| CASH_IN|126508.25527180695|
|   DEBIT|13318.535518284714|
| PAYMENT|12556.450185716356|
+--------+------------------+

Group map UDF

As with the Group Aggregate UDF, we use groupBy() to divide a Spark DataFrame into manageable batches. Each input batch is mapped over by the Group Map UDF, resulting in a (Pandas) DataFrame, which is then combined back into a single (Spark) DataFrame.

def normalize_by_type(data: pd.DataFrame) -> pd.DataFrame:
    result = data[['type', 'amount']].copy()
    maxVal = result['amount'].max()
    minVal = result['amount'].min()
    if maxVal == minVal:
        result['amountNorm'] = 0.5
    else:
        result['amountNorm'] = (result['amount'] - minVal) / (maxVal - minVal)
    return result

# We can use the SQL string-based schema like below comment
# schema = 'type string, amount double, amountNorm double'
schema = T.StructType([
    T.StructField('type', T.StringType()),
    T.StructField('amount', T.DoubleType()),
    T.StructField('amountNorm', T.DoubleType())
])

(
    df.groupBy('type')
    .applyInPandas(normalize_by_type, schema)
    .show(10)
)

+--------+---------+--------------------+
|    type|   amount|          amountNorm|
+--------+---------+--------------------+
|TRANSFER|    181.0|1.929785364412691...|
|TRANSFER| 215310.3| 0.00232902269229461|
|TRANSFER|311685.89|0.003371535041334062|
|TRANSFER|  62610.8|6.772443276469881E-4|
|TRANSFER| 42712.39|4.619995945019032E-4|
|TRANSFER| 77957.68|8.432543299642404E-4|
|TRANSFER| 17231.46|1.863677235062513...|
|TRANSFER| 78766.03|8.519983994671721E-4|
|TRANSFER|224606.64|0.002429582898990...|
|TRANSFER|125872.53|0.001361558008596...|
+--------+---------+--------------------+
only showing top 10 rows

You can see that in the example above, we don’t need to explicitly create a UDF. This is due to the use of applyInPandas function which is new in PySpark 3.0.0. The function takes a regular Python function and a result schema as parameters. If you want to create a Group Map UDF, you can refer to the following code:

# It is preferred to use 'applyInPandas' over this API (in Spark 3). 
# This API will be deprecated in the future releases.
# As it will be deprecated soon, type hint inference is not supported. So, we have to specify PandasUDFType explicitly
NormalizeByType = F.pandas_udf(normalize_by_type, schema, F.PandasUDFType.GROUPED_MAP)

(
    df.groupBy('type')
    .apply(NormalizeByType)
    .show(10)
)

+--------+---------+--------------------+
|    type|   amount|          amountNorm|
+--------+---------+--------------------+
|TRANSFER|    181.0|1.929785364412691...|
|TRANSFER| 215310.3| 0.00232902269229461|
|TRANSFER|311685.89|0.003371535041334062|
|TRANSFER|  62610.8|6.772443276469881E-4|
|TRANSFER| 42712.39|4.619995945019032E-4|
|TRANSFER| 77957.68|8.432543299642404E-4|
|TRANSFER| 17231.46|1.863677235062513...|
|TRANSFER| 78766.03|8.519983994671721E-4|
|TRANSFER|224606.64|0.002429582898990...|
|TRANSFER|125872.53|0.001361558008596...|
+--------+---------+--------------------+
only showing top 10 rows

When executes Group Map UDF, Spark will:

Split the data into groups using groupBy.
Apply the function to each group.
Combine the results in a new PySpark DataFrame.

Conclusion

In summary, PySpark UDFs are an effective way to bring the power and flexibility of Python to Spark workloads. When used properly, they can help extend Spark’s capabilities to solve complex data engineering challenges. Together with the previous tutorial, you can now cover most data manipulation and analysis tasks. Happy coding!

A Practical PySpark tutorial for beginners in Jupyter Notebook

Thu, 08 Feb 2024 00:00:00 +0000

Introduction

In today’s world of data, the ability to efficiently process and analyze large amount of data is crucial for businesses and organizations. This is where PySpark comes in - an open-source, distributed computing framework built on top of Apache Spark. With its seamless integration with Python, PySpark allows users to leverage the powerful data processing capabilities of Spark directly from Python scripts.

This post was originally a Jupyter Notebook I created when I started learning PySpark, intended as a cheat sheet for me when working with it. As I started to have a blog (a place for my notes), I decided to update and share it here as a complete hands-on tutorial for beginners.

If you are new to PySpark, this tutorial is for you. We will cover the basic, most practical, syntax of PySpark. By the end of this tutorial, you will have a solid understanding of PySpark and be able to use Spark in Python to perform a wide range of data processing tasks.

Spark vs PySpark

What is PySpark? How is it different from Apache Spark? Before looking at PySpark, it’s essential to understand the relationship between Spark and PySpark.

Apache Spark is an open source distributed computing system. It provides an interface for programming clusters with implicit data parallelism and fault tolerance. Apache Spark provides API for various programming languages, including Python, Java, Scala, R, making it accessible to various audiences to perform data processing tasks.

PySpark, on the other hand, is the library that uses the provided APIs to provide Python support for Spark. It allows developers to use Python, the most popular programming language in the data community, to leverage the power of Spark without having to switch to another language. PySpark also offers seamless integration with other Python libraries.

In short, Spark is the overarching framework, PySpark serves as its Python API, providing a convenient bridge for Python enthusiasts to leverage Spark’s capabilities.

Let’s get started

From this point on, you will see Python code doing Spark. This hands-on tutorial will guide you through basic PySpark operations such as querying, filtering, merging, and grouping data. You can find an executable notebook on my Github.

Installation

There are several ways to install PySpark. The easiest way for Python users is to use pip.

pip install pyspark

SparkSession

SparkSession is the entry point for working with Apache Spark. It provides a unified interface for interacting with Spark functionality, allowing you to create DataFrames, execute SQL queries, and manage Spark configurations. Think of it as the gateway to all Spark operations in your application.

from pyspark.sql import SparkSession

# Get existed or Create new SparkSession
spark = SparkSession.builder.appName('Spark Demo').master('local[*]').getOrCreate()
spark

SparkSession - in-memory

SparkContext

Spark UI

Version    v3.2.1
Master     local[*]
AppName    Spark Demo

Load data

PySpark can load data from various types of data storage. In this tutorial we will use the Fraudulent Transactions Dataset. This dataset provides a CSV file that is sufficient for demo purposes.

The SparkSession object provides read as a property that returns a DataFrameReader that can be used to read data as a DataFrame. The following code reads a csv file into a DataFrame.

# Load CSV file to DataFrame
data_path = '../input/fraudulent-transactions-data/Fraud.csv'
df = spark.read.csv(data_path, header=True, inferSchema=True)
df.printSchema()

root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)

The inferSchema parameter allows Spark to automatically infer the data types of each column based on the actual data in the file. This involves reading a sample of data, which can be computationally expensive. This can also be incorrect, especially if sample data doesn’t represent the entire dataset well.

Alternatively, to achieve better performance and ensure accurate data types, you can define the schema explicitly.

from pyspark.sql import types as T

# Read CSV with pre-defined schema
predefined_schema = T.StructType([
    T.StructField('step', T.IntegerType()),
    T.StructField('type', T.StringType()),
    T.StructField('amount', T.DoubleType()),
    T.StructField('nameOrig', T.StringType()),
    T.StructField('oldbalanceOrg', T.DoubleType()),
    T.StructField('newbalanceOrig', T.DoubleType()), 
    T.StructField('nameDest', T.StringType()),
    T.StructField('oldbalanceDest', T.DoubleType()),
    T.StructField('newbalanceDest', T.DoubleType()), 
    T.StructField('isFraud', T.IntegerType()),
    T.StructField('isFlaggedFraud', T.IntegerType())
])

df = spark.read.csv(data_path, schema=predefined_schema, header=True)

root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)

The data set contains some misformatted column names. I will rename them all to camel case.

# Rename columns
corrected_cols = {'oldbalanceOrg': 'oldBalanceOrig', 'newbalanceOrig': 'newBalanceOrig', 
                  'oldbalanceDest': 'oldBalanceDest', 'newbalanceDest': 'newBalanceDest'}
for old_col, new_col in corrected_cols.items():
    df = df.withColumnRenamed(old_col, new_col)

df.printSchema()

root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldBalanceOrig: double (nullable = true)
 |-- newBalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldBalanceDest: double (nullable = true)
 |-- newBalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)

Data Overview

You can quickly look at the data with DataFrame.show which prints the first n rows to the screen.

# Prints top 10 rows of PySpark DataFrame to the screen
df.show(10)

+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
|step|    type|  amount|   nameOrig|oldBalanceOrig|newBalanceOrig|   nameDest|oldBalanceDest|newBalanceDest|isFraud|isFlaggedFraud|
+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
|   1| PAYMENT| 9839.64|C1231006815|      170136.0|     160296.36|M1979787155|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 1864.28|C1666544295|       21249.0|      19384.72|M2044282225|           0.0|           0.0|      0|             0|
|   1|TRANSFER|   181.0|C1305486145|         181.0|           0.0| C553264065|           0.0|           0.0|      1|             0|
|   1|CASH_OUT|   181.0| C840083671|         181.0|           0.0|  C38997010|       21182.0|           0.0|      1|             0|
|   1| PAYMENT|11668.14|C2048537720|       41554.0|      29885.86|M1230701703|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7817.71|  C90045638|       53860.0|      46042.29| M573487274|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7107.77| C154988899|      183195.0|     176087.23| M408069119|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7861.64|C1912850431|     176087.23|     168225.59| M633326333|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 4024.36|C1265012928|        2671.0|           0.0|M1176932104|           0.0|           0.0|      0|             0|
|   1|   DEBIT| 5337.77| C712410124|       41720.0|      36382.23| C195600860|       41898.0|      40348.79|      0|             0|
+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
only showing top 10 rows

In many cases, the result does not fit on the screen and produces unreadable output.

This is where Python comes in. With PySpark, you can mix Python code with Spark APIs to improve the result. The following Python function will show you how to use a Python loop to split and display a sample of data.

# Split columns into subsets and show it accordingly
def show_split(df, split=-1, n_samples=10):
    n_cols = len(df.columns)
    if split <= 0:
        split = n_cols
    i = 0
    j = i + split
    while i < n_cols:
        df.select(*df.columns[i:j]).show(n_samples)
        i = j
        j = i + split

show_split(df, 4, 10)

+----+--------+--------+-----------+
|step|    type|  amount|   nameOrig|
+----+--------+--------+-----------+
|   1| PAYMENT| 9839.64|C1231006815|
|   1| PAYMENT| 1864.28|C1666544295|
|   1|TRANSFER|   181.0|C1305486145|
|   1|CASH_OUT|   181.0| C840083671|
|   1| PAYMENT|11668.14|C2048537720|
|   1| PAYMENT| 7817.71|  C90045638|
|   1| PAYMENT| 7107.77| C154988899|
|   1| PAYMENT| 7861.64|C1912850431|
|   1| PAYMENT| 4024.36|C1265012928|
|   1|   DEBIT| 5337.77| C712410124|
+----+--------+--------+-----------+
only showing top 10 rows

+--------------+--------------+-----------+--------------+
|oldBalanceOrig|newBalanceOrig|   nameDest|oldBalanceDest|
+--------------+--------------+-----------+--------------+
|      170136.0|     160296.36|M1979787155|           0.0|
|       21249.0|      19384.72|M2044282225|           0.0|
|         181.0|           0.0| C553264065|           0.0|
|         181.0|           0.0|  C38997010|       21182.0|
|       41554.0|      29885.86|M1230701703|           0.0|
|       53860.0|      46042.29| M573487274|           0.0|
|      183195.0|     176087.23| M408069119|           0.0|
|     176087.23|     168225.59| M633326333|           0.0|
|        2671.0|           0.0|M1176932104|           0.0|
|       41720.0|      36382.23| C195600860|       41898.0|
+--------------+--------------+-----------+--------------+
only showing top 10 rows

+--------------+-------+--------------+
|newBalanceDest|isFraud|isFlaggedFraud|
+--------------+-------+--------------+
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      1|             0|
|           0.0|      1|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|      40348.79|      0|             0|
+--------------+-------+--------------+
only showing top 10 rows

When working with numerical data, it is not very useful to look at a long series of values. We are often more interested in a few key information points, such as count, mean, standard deviation, minimum, and maximum. PySpark’s DataFrame provides describe and summary function with different usage to present these essential metrics.

# DataFrame.describe take columns as params
df.describe('step', 'amount').show()

+-------+------------------+------------------+
|summary|              step|            amount|
+-------+------------------+------------------+
|  count|           6362620|           6362620|
|   mean|243.39724563151657|179861.90354913412|
| stddev|142.33197104912588| 603858.2314629498|
|    min|                 1|               0.0|
|    max|               743|     9.244551664E7|
+-------+------------------+------------------+

# DataFrame.summary take statistics as params
df.select('oldBalanceOrig', 'newBalanceOrig', 'oldBalanceDest', 'newBalanceDest').summary('count', 'min', 'max', 'mean', '50%').show()

+-------+-----------------+-----------------+------------------+------------------+
|summary|   oldBalanceOrig|   newBalanceOrig|    oldBalanceDest|    newBalanceDest|
+-------+-----------------+-----------------+------------------+------------------+
|  count|          6362620|          6362620|           6362620|           6362620|
|    min|              0.0|              0.0|               0.0|               0.0|
|    max|    5.958504037E7|    4.958504037E7|    3.5601588935E8|    3.5617927892E8|
|   mean|833883.1040744719|855113.6685785714|1100701.6665196654|1224996.3982019408|
|    50%|         14211.23|              0.0|         132612.49|         214605.81|
+-------+-----------------+-----------------+------------------+------------------+

Query data

Select and Filter

PySpark borrowed a lot of vocabulary from the SQL world. But it offers the flexibility that we do not need to follow the strict SQL framework (select what from where if condition met …). Each step of PySpark will return a DataFrame or GroupedData that we can continue to work with normally.

from pyspark.sql import functions as F

# First .where() filter DataFrame and return another DataFrame
# Then .select() select from the returned DataFrame 
df.where(df['type']=='CASH_OUT').select(df.type, F.col('amount')).show(10)

+--------+---------+
|    type|   amount|
+--------+---------+
|CASH_OUT|    181.0|
|CASH_OUT|229133.94|
|CASH_OUT|110414.71|
|CASH_OUT|  56953.9|
|CASH_OUT|  5346.89|
|CASH_OUT|  23261.3|
|CASH_OUT| 82940.31|
|CASH_OUT| 47458.86|
|CASH_OUT|136872.92|
|CASH_OUT| 94253.33|
+--------+---------+
only showing top 10 rows

The above example shows us three different ways to access pyspark columns:

df.type: Access as an attribute.
df['type']: Access as an items.
F.col('type'): Explicitly specify that we need a column, not a string literal.

You can also filter multiple conditions using &, |, and ~ operator.

# PySpark example filter multiple conditions
df.where((F.col('type')=='CASH_OUT') & (F.col('amount') > 500)).show(10)

For users who are more familiar with SQL syntax, Spark provides the ability to write SQL queries directly. Before writing SQL queries in PySpark, you need to register your DataFrame. This allows you to reference it in your SQL queries.

# Create or replace temp view named "df" from DataFrame df in PySpark
df.createOrReplaceTempView('df')

# Spark SQL query example. You can now reference df in your query
spark.sql('''
    SELECT type, amount 
    FROM df
    WHERE type = "CASH_OUT"    
''').show(10)

+--------+---------+
|    type|   amount|
+--------+---------+
|CASH_OUT|    181.0|
|CASH_OUT|229133.94|
|CASH_OUT|110414.71|
|CASH_OUT|  56953.9|
|CASH_OUT|  5346.89|
|CASH_OUT|  23261.3|
|CASH_OUT| 82940.31|
|CASH_OUT| 47458.86|
|CASH_OUT|136872.92|
|CASH_OUT| 94253.33|
+--------+---------+
only showing top 10 rows

Aggregating with `groupBy`

PySpark provides a similar syntax to Pandas for aggregating data.

# Example to PySpark groupBy
# Sometimes we can pass column name directly to pyspark functions
# `Column.alias` method change the name of the result column.
df.select('type', 'amount').groupBy('type').agg(F.mean('amount').alias('avgAmount')).orderBy('avgAmount').show(10)

spark.sql('''
    SELECT type, AVG(amount) avgAmount
    FROM df
    GROUP BY type
    ORDER BY 2
''').show(10)

+--------+------------------+
|    type|         avgAmount|
+--------+------------------+
|   DEBIT| 5483.665313767128|
| PAYMENT|13057.604660187604|
| CASH_IN| 168920.2420040954|
|CASH_OUT|176273.96434613998|
|TRANSFER| 910647.0096454868|
+--------+------------------+

To filter after groupBy, we can just simply apply where or filter to the result DataFrame object or follow SQL framework with having keyword.

(
    df.where(df['type']=='CASH_OUT')
    .groupBy('nameOrig')
    .agg(F.sum('amount').alias('sumAmount'))
    .where(F.col('sumAmount') > 300000)
    .show(10)
)

spark.sql('''
    SELECT nameOrig, SUM(amount) sumAmount
    FROM df
    WHERE type = "CASH_OUT"
    GROUP BY 1
    HAVING sumAmount > 300000
''').show(10)

+-----------+---------+
|   nameOrig|sumAmount|
+-----------+---------+
| C551314014|301050.58|
| C661668091|323789.56|
| C228994633|517946.01|
|C1591008292|558254.22|
|C2100435651|357988.09|
| C624052656|476735.47|
| C948681098|353759.28|
|  C50682517|386128.82|
|C1579521009|684561.18|
|C1871922377|394317.12|
+-----------+---------+
only showing top 10 rows

Union and Intersection

df.select('nameOrig').union(df.select('nameDest')).count()

12725240

spark.sql('''
    SELECT nameOrig from df
    UNION
    SELECT nameDest from df
''').count()

We can see the difference in the count here. The reason is that PySpark union function keeps the duplicate samples from two sets. This is equivalent to UNION ALL in SQL. By default, PySpark will not remove duplidates as it is an expensive process. If you want to drop duplicates, you have to do it explicitly.

# Union and drop duplicates in PySpark
df.select('nameOrig').union(df.select('nameDest')).dropDuplicates().count()

Unioning can be useful when we are reading data from multiple files. We can read them one by one in a Python loop and union them.

Intersection is similar to Union. But, keep in mind that PySpark intersect is equivalent to SQL INTERSECT, not INTERSECT ALL.

Join

Very similar to Pandas, DataFrame.join method joins a DataFrame with another DataFrame using the given join expression.

(
    df.where('type = "CASH_IN" OR type = "CASH_OUT"')
    .selectExpr('nameOrig', 'ABS(newBalanceOrig - oldBalanceOrig) changeOrig')
    .groupBy('nameOrig')
    .agg(
        F.mean(F.col('changeOrig')).alias('avgChangeOrig'),
        F.count('*').alias('occOrig')
    )
    .where('avgChangeOrig > 100000')
    # Join the above DataFrame with the one provided in parameter
    .join((
        df.where('type = "CASH_IN" OR type = "CASH_OUT"')
        .selectExpr('nameDest', 'ABS(newBalanceDest - oldBalanceDest) changeDest')
        .groupBy('nameDest')
        .agg(
            F.mean(F.col('changeDest')).alias('avgChangeDest'),
            F.count('*').alias('occDest')
        )
        .where('avgChangeDest > 100000')
    ), on=F.col('nameOrig')==F.col('nameDest'), how='inner')
    # There are several join method: inner, left, right, cross, outer, left_outer, right_outer, left_semi, left_anti, right_semi, right_anti, ...
    .selectExpr('nameOrig name', 'occOrig + occDest occ', 'avgChangeOrig', 'avgChangeDest')
    .orderBy('occ', ascending=False)
).show(10)

spark.sql('''
    SELECT nameOrig name, occOrig + occDest occ, avgChangeOrig, avgChangeDest
    FROM
    (
        SELECT nameOrig, AVG(ABS(newBalanceOrig - oldBalanceOrig)) avgChangeOrig, COUNT(*) occOrig
        FROM df
        WHERE type = "CASH_IN" OR type = "CASH_OUT"
        GROUP BY nameOrig
        HAVING avgChangeOrig > 100000
    )
    INNER JOIN
    (
        SELECT nameDest, AVG(ABS(newBalanceDest - oldBalanceDest)) avgChangeDest, COUNT(*) occDest
        FROM df
        WHERE type = "CASH_IN" OR type = "CASH_OUT"
        GROUP BY nameDest
        HAVING avgChangeDest > 100000
    )
    ON nameOrig = nameDest
    ORDER BY occ DESC
''').show(10)

+-----------+---+------------------+------------------+
|       name|occ|     avgChangeOrig|     avgChangeDest|
+-----------+---+------------------+------------------+
|C1552859894| 43|193711.30000000005| 763241.1652380949|
|C1819271729| 37|         278937.79|283626.17805555544|
|C1692434834| 37|177369.73000000045| 438853.7616666666|
| C889762313| 32|         132731.31|211437.18741935486|
|C1868986147| 32|         120594.03|249840.37709677417|
|  C55305556| 28|319860.45999999903|225565.42111111112|
| C636092700| 26|217273.86000000004|201888.05279999998|
|C1713505653| 25| 278622.8400000003|186625.34916666665|
|C2029542508| 24| 235760.1200000001|231022.98217391354|
| C699906968| 23| 177813.3799999999| 183054.3072727272|
+-----------+---+------------------+------------------+
only showing top 10 rows

In the above example, I demonstrated mixing Python Spark and SQL syntax for cleaner code. Instead of the verbose expression:

df.where((F.col('type')=='CASH_IN') | (F.col('type')=='CASH_OUT'))

You can write:

df.where('type = "CASH_IN" OR type = "CASH_OUT"')

This style can be applied in various Python Spark functions: selectExpr, where, filter, expr,… Choose your preferred coding style – PySpark offers the flexibility.

Endnote

This tutorial has covered basic Spark operations in both Python and SQL syntax. You will be able to perform most common data transformation and analysis tasks. But your Spark journey doesn’t end here! There are more advanced features that were not covered in this article (e.g., UDF). They will be discussed in another post soon.

Snowflake ID - Simplifying uniqueness in distributed systems

Sat, 03 Feb 2024 00:00:00 +0000

Problem description

In developing database systems, generating IDs is a crucial task. IDs ensure the uniqueness of data, facilitate queries, and establish relationship constraints in databases. Most modern database management systems (DBMS) can generate auto-increment IDs. We can delegate this task to the DBMS entirely and not worry about the uniqueness. However, there are several reasons why we shouldn’t use auto-increment IDs, especially for distributed systems. The most important reason is that in distributed systems with independent servers, using per-server auto-increment IDs does not guarantee uniqueness and can lead to duplication problems.

Snowflake ID is the solution developed by Twitter engineers to address this problem. According to statistics, about 6,000 tweets are written and posted on Twitter every second. How can we generate 6,000 IDs per second independently on multiple servers without collision?

Hold on! What about UUID?

UUID is also a widely used ID generation technique that has been used in software for a long time.

The idea of this technique is to use a 128-bit number as an ID. A 128-bit integer means that if we use 6,000 IDs every second, it will take over 2^128 / (6000 * 3600 * 24 * 365) = 1.79838*10^27 (how is it pronounced, octillion?) to exhaust them. And if we randomly pick 103 trillion of 126-bit numbers, the chance of a collision is one in a billion. Of course, this number is not generated in a simple incremental count like 1, 2, 3, … but follows a certain standard. And when generated accordingly, UUIDs can solve the problem of generating non-duplicate IDs in distributed systems.

But it introduces another problem. 128-bits is unnecessarily large. And most computers don’t support us working directly with the 128-bit integer data type. So we usually have to use strings to process them. In addition, a large-size ID hurts the query performance because the index gets larger and operations/calculations become more costly.

Example UUIDs:

8fbb69e1-2132-4c86-911b-4cc182a5513a
b1f352c3-2126-4cca-9eec-349cdb69b611
6c7fa5c6-1b70-4b47-8690-760f2871943d
df495b1b-cd86-4e08-a42a-9f73d2c5afd1
13e1b2c2-ed6e-46ee-94a3-efce635ef268

Snowflake ID

To solve this problem, Twitter engineers introduced a system called Snowflake ID. The idea of this system is to programmatically generate a 64-bit integer to represent the ID. But how does each server independently generate this ID without collision?

The proposed method is as follows:

The first 1 bit is not in use (always 0) to make it fit into a signed integer (always positive).
The next 41 bits store information about the ID creation time, measured in milliseconds from a given point in time (epoch 1288834974657 in Unix time).
The next 10 bits store information about the machine requesting the ID.
The last 12 bits are a sequential counter from 1 to 4096. To avoid duplicate IDs within the same time frame on the same machine.

The only scenario where collisions can occur is when a single machine requests more than 4096 IDs in a single millisecond (or in the case of Twitter, when a machine posts 4096 tweets in a millisecond). ⁤With Snowflake ID, we are able to solve the problem of generating non-duplicate IDs in distributed systems with a 64-bit integer. ⁤⁤Additionally, Snowflake ID itself contains information about the creation time. ⁤⁤Therefore, we can know the creation time of an ID just by looking at the ID, or we can sort by ID and get results sorted by time. ⁤

Conclusion

Is Snowflake ID the answer for every system? Absolutely not! Nothing is the answer for everything. Besides the ID generation strategies mentioned above, there are many other approaches (e.g. Flickr’s centralized ticket server). There are always many ways to solve a problem. Each goes with pros and cons. Don’t limit yourself to existing methods; always look for new, context-appropriate solutions.

Dat a Engineer

PIVOT and Dynamic PIVOT in SQL - Advanced SQL for analytics

Problem Statement

GROUP BY - The Amateur Way

How to create Azure DevOps Pull Requests reporting with Power BI

Pre-requisites

Parameters

Build the Power Query

Fetch data from Azure DevOps

Convert to Table

Continue expanding columns

Add details from other APIs

Iterations

Changes

Threads

Full source code

Visualize insights

Conclusion

How to start a successful Data Warehouse project

Understand Business Requirements

Understand System

Design a reliable Data Warehouse

Choose the right tool for the right job

Start small, Grow big

Engage users

Conclusion

Understand Row-Oriented vs Column-Oriented Storage

OLAP

Row-Oriented Storage

Column-Oriented Storage

Advantages of Column-Oriented Storage

Conclusion

OLTP & OLAP - Why we need Data Warehouse

OLTP

OLAP

Difference between OLTP and OLAP

Problems of OLTP Databases with OLAP queries

Data Warehouse

Conclusion

Recursive CTEs and CONNECT BY in SQL to query Hierarchical data

Hierarchical Data

Hierarchical Data representation in SQL

Querying Hierarchical Data

What is a reliable Data System?

What is Reliability?

Reliable Data system

How important is Reliability

PySpark UDFs: A comprehensive guide to unlock PySpark potential

Introduction

Understanding PySpark UDFs

Creating an UDF

Pandas UDF

Series to Series UDF

Iterator of Series to Iterator of Series

Iterator of multiple Series to Iterator of Series UDF

Group aggregate UDF

Group map UDF

Conclusion

A Practical PySpark tutorial for beginners in Jupyter Notebook

Introduction

Spark vs PySpark

Let’s get started

Installation

SparkSession

Load data

Data Overview

Query data

Select and Filter

Aggregating with groupBy

Union and Intersection

Join

Endnote

Snowflake ID - Simplifying uniqueness in distributed systems

Problem description

Hold on! What about UUID?

Snowflake ID

Conclusion

Aggregating with `groupBy`