What is a reliable Data System?

In today’s data-driven world, information is gold, and the systems that store and manage it serve as crucial infrastructure. I have seen people talk a lot about terms like “distributed computing”, “scalability”… but one fundamental characteristic is often overlooked: reliability. Without it, scalability, maintainability, flexibility, anything-bility are meaningless, like a beautiful castle built on sand. What is Reliability? Everyone has their own intuition about what is reliable: A piggy bank is reliable because it consistently holds your money and accurately reflects what you’ve deposited....

Feb 16, 2024 · 4 min

PySpark UDFs: A comprehensive guide to unlock PySpark potential

Introduction Apache Spark is a powerful open source distributed computing engine designed to handle large datasets across clusters. PySpark is the Python programming API for Spark. It allows data engineers and data scientists can easily utilize the framework in their preferred language. This post is a continuation of the previous tutorial. Originally a Jupyter notebook I created while learning PySpark, I recently found it and decided to update it and publish it on my blog....

Feb 9, 2024 · 7 min
PySpark tutorial for beginners in Jupyter Notebook

A Practical PySpark tutorial for beginners in Jupyter Notebook

Introduction In today’s world of data, the ability to efficiently process and analyze large amount of data is crucial for businesses and organizations. This is where PySpark comes in - an open-source, distributed computing framework built on top of Apache Spark. With its seamless integration with Python, PySpark allows users to leverage the powerful data processing capabilities of Spark directly from Python scripts. This post was originally a Jupyter Notebook I created when I started learning PySpark, intended as a cheat sheet for me when working with it....

Feb 8, 2024 · 11 min

Snowflake ID - Simplifying uniqueness in distributed systems

Problem description In developing database systems, generating IDs is a crucial task. IDs ensure the uniqueness of data, facilitate queries, and establish relationship constraints in databases. Most modern database management systems (DBMS) can generate auto-increment IDs. We can delegate this task to the DBMS entirely and not worry about the uniqueness. However, there are several reasons why we shouldn’t use auto-increment IDs, especially for distributed systems. The most important reason is that in distributed systems with independent servers, using per-server auto-increment IDs does not guarantee uniqueness and can lead to duplication problems....

Feb 3, 2024 · 3 min