Dat a Engineer

OLTP & OLAP - Why we need Data Warehouse

Today, I was advising a team on building their data warehouse solution. I realized that even 40 years after the term “data warehouse” was first introduced, there are still questions about why we need a data warehouse and why we don’t get all of the data from application databases, especially by executives. I write this post to answer these questions by clarifying the terms OLTP and OLAP, which are frequently used in discussions about data warehouse database architecture. Then I will explain why OLTP databases are inefficient for OLAP queries and why you need a separate database known as a data warehouse. ...

Recursive CTEs and CONNECT BY in SQL to query Hierarchical data

In database design, the idea of hierarchical data represents relationships between entities as a tree-like structure. This type of data model is widely used in many domains, such as file systems, organizational structure, etc. When dealing with hierarchical data, it is crucial to efficiently query and extract information about the relationships between entities. In this post, we will explore two powerful SQL tools for querying hierarchical data: recursive Common Table Expressions (CTEs) and the CONNECT BY clause. ...

What is a reliable Data System?

In today’s data-driven world, information is gold, and the systems that store and manage it serve as crucial infrastructure. I have seen people talk a lot about terms like “distributed computing”, “scalability”… but one fundamental characteristic is often overlooked: reliability. Without it, scalability, maintainability, flexibility, anything-bility are meaningless, like a beautiful castle built on sand. What is Reliability? Everyone has their own intuition about what is reliable: A piggy bank is reliable because it consistently holds your money and accurately reflects what you’ve deposited. You trust that when you put a coin in, it will be there later, and the total will reflect your savings. And when you want to make a withdrawal, you can get your money immediately. A calculator is reliable because it consistently produces accurate results based on your input. You trust that regardless of who uses it, 2 + 2 will always equal 4. And the result should appear instantly on the screen. Different systems have different reliability requirements. In general, we can define reliability as follow: ...

PySpark UDFs: A comprehensive guide to unlock PySpark potential

Introduction Apache Spark is a powerful open source distributed computing engine designed to handle large datasets across clusters. PySpark is the Python programming API for Spark. It allows data engineers and data scientists can easily utilize the framework in their preferred language. This post is a continuation of the previous tutorial. Originally a Jupyter notebook I created while learning PySpark, I recently found it and decided to update it and publish it on my blog. ...

A Practical PySpark tutorial for beginners in Jupyter Notebook

Introduction In today’s world of data, the ability to efficiently process and analyze large amount of data is crucial for businesses and organizations. This is where PySpark comes in - an open-source, distributed computing framework built on top of Apache Spark. With its seamless integration with Python, PySpark allows users to leverage the powerful data processing capabilities of Spark directly from Python scripts. This post was originally a Jupyter Notebook I created when I started learning PySpark, intended as a cheat sheet for me when working with it. As I started to have a blog (a place for my notes), I decided to update and share it here as a complete hands-on tutorial for beginners. ...

Snowflake ID - Simplifying uniqueness in distributed systems

Problem description In developing database systems, generating IDs is a crucial task. IDs ensure the uniqueness of data, facilitate queries, and establish relationship constraints in databases. Most modern database management systems (DBMS) can generate auto-increment IDs. We can delegate this task to the DBMS entirely and not worry about the uniqueness. However, there are several reasons why we shouldn’t use auto-increment IDs, especially for distributed systems. The most important reason is that in distributed systems with independent servers, using per-server auto-increment IDs does not guarantee uniqueness and can lead to duplication problems. ...