Understand Row-Oriented vs Column-Oriented Storage

The way we access and analyze data has changed a lot lately. Row-oriented storage, which has been the standard for data storage for a long time, is having trouble keeping up with the demands of modern data analysis. In this article, I will introduce you to column-oriented storage and how it can help analytical queries run faster. OLAP In my previous post, we discussed the differences between Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP). As a reminder, OLAP which is the access pattern of analytical queries typically: ...

Apr 5, 2024 · 5 min
OLTP vs OLAP Differences

OLTP & OLAP - Why we need Data Warehouse

Today, I was advising a team on building their data warehouse solution. I realized that even 40 years after the term “data warehouse” was first introduced, there are still questions about why we need a data warehouse and why we don’t get all of the data from application databases, especially by executives. I write this post to answer these questions by clarifying the terms OLTP and OLAP, which are frequently used in discussions about data warehouse database architecture. Then I will explain why OLTP databases are inefficient for OLAP queries and why you need a separate database known as a data warehouse. ...

Feb 28, 2024 · 4 min

Recursive CTEs and CONNECT BY in SQL to query Hierarchical data

In database design, the idea of hierarchical data represents relationships between entities as a tree-like structure. This type of data model is widely used in many domains, such as file systems, organizational structure, etc. When dealing with hierarchical data, it is crucial to efficiently query and extract information about the relationships between entities. In this post, we will explore two powerful SQL tools for querying hierarchical data: recursive Common Table Expressions (CTEs) and the CONNECT BY clause. ...

Feb 20, 2024 · 10 min

What is a reliable Data System?

In today’s data-driven world, information is gold, and the systems that store and manage it serve as crucial infrastructure. I have seen people talk a lot about terms like “distributed computing”, “scalability”… but one fundamental characteristic is often overlooked: reliability. Without it, scalability, maintainability, flexibility, anything-bility are meaningless, like a beautiful castle built on sand. What is Reliability? Everyone has their own intuition about what is reliable: A piggy bank is reliable because it consistently holds your money and accurately reflects what you’ve deposited. You trust that when you put a coin in, it will be there later, and the total will reflect your savings. And when you want to make a withdrawal, you can get your money immediately. A calculator is reliable because it consistently produces accurate results based on your input. You trust that regardless of who uses it, 2 + 2 will always equal 4. And the result should appear instantly on the screen. Different systems have different reliability requirements. In general, we can define reliability as follow: ...

Feb 16, 2024 · 4 min

PySpark UDFs: A comprehensive guide to unlock PySpark potential

Introduction Apache Spark is a powerful open source distributed computing engine designed to handle large datasets across clusters. PySpark is the Python programming API for Spark. It allows data engineers and data scientists can easily utilize the framework in their preferred language. This post is a continuation of the previous tutorial. Originally a Jupyter notebook I created while learning PySpark, I recently found it and decided to update it and publish it on my blog. ...

Feb 9, 2024 · 7 min
PySpark tutorial for beginners in Jupyter Notebook

A Practical PySpark tutorial for beginners in Jupyter Notebook

Introduction In today’s world of data, the ability to efficiently process and analyze large amount of data is crucial for businesses and organizations. This is where PySpark comes in - an open-source, distributed computing framework built on top of Apache Spark. With its seamless integration with Python, PySpark allows users to leverage the powerful data processing capabilities of Spark directly from Python scripts. This post was originally a Jupyter Notebook I created when I started learning PySpark, intended as a cheat sheet for me when working with it. As I started to have a blog (a place for my notes), I decided to update and share it here as a complete hands-on tutorial for beginners. ...

Feb 8, 2024 · 11 min