PySpark UDFs: A comprehensive guide to unlock PySpark potential

Introduction Apache Spark is a powerful open source distributed computing engine designed to handle large datasets across clusters. PySpark is the Python programming API for Spark. It allows data engineers and data scientists can easily utilize the framework in their preferred language. This post is a continuation of the previous tutorial. Originally a Jupyter notebook I created while learning PySpark, I recently found it and decided to update it and publish it on my blog....

Feb 9, 2024 · 7 min
PySpark tutorial for beginners in Jupyter Notebook

A Practical PySpark tutorial for beginners in Jupyter Notebook

Introduction In today’s world of data, the ability to efficiently process and analyze large amount of data is crucial for businesses and organizations. This is where PySpark comes in - an open-source, distributed computing framework built on top of Apache Spark. With its seamless integration with Python, PySpark allows users to leverage the powerful data processing capabilities of Spark directly from Python scripts. This post was originally a Jupyter Notebook I created when I started learning PySpark, intended as a cheat sheet for me when working with it....

Feb 8, 2024 · 11 min