spark on Dat a Engineer

PySpark UDFs: A comprehensive guide to unlock PySpark potential

Fri, 09 Feb 2024 00:00:00 +0000

Introduction

Apache Spark is a powerful open source distributed computing engine designed to handle large datasets across clusters. PySpark is the Python programming API for Spark. It allows data engineers and data scientists can easily utilize the framework in their preferred language.

This post is a continuation of the previous tutorial. Originally a Jupyter notebook I created while learning PySpark, I recently found it and decided to update it and publish it on my blog.

UDFs (user-defined functions) are an integral part of PySpark, allowing users to extend the capabilities of Spark by creating their own custom functions. This article will provide a comprehensive guide to PySpark UDFs with examples.

Understanding PySpark UDFs

PySpark UDFs are user-defined functions written in Python code. We create functions in Python and register them with Spark as UDFs. They enable the execution of complicated custom logic on Spark DataFrames and SQL expressions.

However, note that UDFs are expensive. We should always prefer built-in functions whenever possible. PySpark comes with a number of predefined common functions, and many more new functions are added with each new release.

In summary, with PySpark UDFs, what goes in is a regular Python function, and what goes out is a function to work on the PySpark engine.

Creating an UDF

All of the following examples are a continuation of the previous article. You can find an executable notebook containing both articles here.

Below is an example of a “complicated” decision tree function that classifies transactions:

# PySpark UDFs example
def classify_tier(amount:float) -> int:
    if amount < 500:
        return 0
    if amount < 10000:
        return 1
    if amount < 100000:
        return 2
    if amount < 1000000:
        return 3
    return 4

It is a regular Python function that receive a float and return an int. We have to make it a PySpark UDF before actually using it.

from pyspark.sql import functions as F

# pyspark.sql.functions provides a udf() function to promote a regular function to be UDF.
# The function takes two parameters: the function you want to promote, and the return type of the generated UDF
# The function return a UDF
classifyTier = F.udf(classify_tier, T.ByteType())

Then we can use it like any other PySpark function.

df.select('nameOrig', classifyTier(df.amount).alias('tier')).orderBy('tier', ascending=False).show(10)

+-----------+----+
|   nameOrig|tier|
+-----------+----+
|C1495608502|   4|
|C1321115948|   4|
| C476579021|   4|
|C1520267010|   4|
| C106297322|   4|
|C1464177809|   4|
| C355885103|   4|
|C1057507014|   4|
|C1419332030|   4|
|C2007599722|   4|
+-----------+----+

The pyspark.sql.functions.udf() function can also be used as a decorator which produce the same result.

# pyspark udf decorator example
# Note that classifyTier is a UDF, not a regular function anymore.
@F.udf(T.ByteType())
def classifyTier(amount:float) -> int:
    if amount < 500:
        return 0
    if amount < 10000:
        return 1
    if amount < 100000:
        return 2
    if amount < 1000000:
        return 3
    return 4

If you want to use it in a Spark SQL expression, we need to register it first.

# Register the regular Python function with spark.udf.register
spark.udf.register('classifyTier', classify_tier)

spark.sql('''
    SELECT nameOrig, classifyTier(amount) tier
    FROM df
    ORDER BY tier DESC 
''').show(10)

+-----------+----+
|   nameOrig|tier|
+-----------+----+
| C263860433|   4|
| C306269750|   4|
|C1611915976|   4|
|C1387188921|   4|
| C300262358|   4|
| C389879985|   4|
|C1907016309|   4|
|C1046638041|   4|
|C1543404166|   4|
|C1155108056|   4|
+-----------+----+

Simple enough? Write a Python function, make it a UDF, use it. But it is not the most interesting part.

Pandas UDF

With Python UDFs, PySpark will unpack each value, perform the calculation, and then return the value for each record. A Pandas UDF is a user-defined function that works with data using Pandas for manipulation and Apache Arrow for data transfer. It is also called a vectorized UDF. Compared to row-at-a-time Python UDFs, pandas UDFs enable vectorized operations that can improve performance by up to 100x.

Series to Series UDF

These UDFs operate on Pandas Series and return a Pandas Series as output. When Spark runs a Pandas UDF, it divides the columns into batches, calls the function on a subset of the data for each batch, and then concatenates the output. It is preferable to use a Pandas Series-to-Series UDF if possible, instead of using a regular Python UDF. We use pyspark.sql.functions.pandas_udf to create a Pandas UDF.

import pandas as pd


# You can also promote the function to PySpark Pandas UDF as getUserType = F.pandas_udf(get_user_type, T.StringType())
# Each User ID starts with a letter represent its type
@F.pandas_udf(T.StringType())
def getUserType(name: pd.Series) -> pd.Series:
    return name.str[0]

The only difference in syntax is that the Python function now takes a pandas.Series' and returns a pandas.Series’. And then we can use it as a Spark function.

(
    df.select(getUserType(df.nameDest).alias('userTypeDest'), df.amount)
    .groupBy('userTypeDest')
    .agg(
        F.mean('amount').alias('avgAmount'),
        F.count('*').alias('n')
    )
    .orderBy('avgAmount', ascending=False)
    .show(10)
)

+------------+------------------+-------+
|userTypeDest|         avgAmount|      n|
+------------+------------------+-------+
|           C| 265083.4571810173|4211125|
|           M|13057.604660187604|2151495|
+------------+------------------+-------+

Iterator of Series to Iterator of Series

Due to the distributed nature of Spark, the entire series is not fed into the UDF at once; instead, each cluster calls the UDF on its own batch of data and then aggregates the result. PySpark Iterator of Series to Iterator of Series UDFs are very useful when we have an time-consuming cold start operation (e.g. initialize a machine learning model, check for some server statuses,…) that you need to perform once at the beginning of the processing step.

from time import sleep
from typing import Iterator, Tuple


@F.pandas_udf(T.ByteType())
def getNameIdLength(name: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # Heavy task
    # sleep(5)
    
    # name is a Iterator
    # name_batch is a pd.Series
    for name_batch in name:
        name_len = name_batch.str.len()
        name_len[~name_batch.str[0].str.isnumeric()] -= 1
        # yield because we return an iterator
        yield name_len

(
    df.select(getNameIdLength(df.nameOrig).alias('idLen'), 'amount')
    .groupBy('idLen')
    .agg(F.mean('amount').alias('avgAmount'))
    .orderBy('avgAmount')
    .show(10)
)

+-----+------------------+
|idLen|         avgAmount|
+-----+------------------+
|    4|155070.73742857145|
|    7|177477.50726081585|
|   10| 179702.4408980949|
|    9|179898.05510125632|
|    8| 181572.2097899971|
|    6|197756.81529433408|
|    5|199594.79368029739|
+-----+------------------+

Iterator of multiple Series to Iterator of Series UDF

Iterator of Multiple Series to Iterator of Series UDF has the same characteristics as Iterator of Series to Iterator of Series UDF. The difference is that the underlying Python function receives an iterator for a tuple of Pandas Series.

def amount_mismatch(values: Iterator[Tuple[pd.Series, pd.Series, pd.Series, pd.Series]]) -> Iterator[pd.Series]:
    # Heavy task
    # ...

    for oldOrig, newOrig, oldDest, newDest in values:
        yield abs(abs(newOrig - oldOrig) - abs(newDest - oldDest))

# Create an UDF. You can also use decorator.
amountMismatch = F.pandas_udf(amount_mismatch, T.DoubleType())

(
    df.select(
        df.type,
        amountMismatch(df.oldBalanceOrig, df.newBalanceOrig, df.oldBalanceDest, df.newBalanceDest).alias('mismatch')
    )
    .groupBy('type')
    .agg(
        F.mean('mismatch').alias('avgMismatch')
    )
    .orderBy('avgMismatch', ascending=False)
    .show(10)
)

+--------+------------------+
|    type|       avgMismatch|
+--------+------------------+
|TRANSFER| 968056.4538892006|
|CASH_OUT|170539.39652580014|
| CASH_IN| 50038.95466155722|
|   DEBIT| 25567.53969902471|
| PAYMENT| 6378.936662041953|
+--------+------------------+

Group aggregate UDF

Group aggregate UDF, also known as the Series to Scalar UDF, reduces the input pandas.Series into a single value.

@F.pandas_udf(T.DoubleType())
def getStdDeviation(series: pd.Series) -> float:
    # Use built-in pandas.Series.std
    return series.std()

(
    df.groupBy('type')
    .agg(
        getStdDeviation(df.amount).alias('var')
    )
    .orderBy('var', ascending=False)
    .show(10)
)

+--------+------------------+
|    type|               var|
+--------+------------------+
|TRANSFER|1879573.5289080725|
|CASH_OUT|175329.74448347004|
| CASH_IN|126508.25527180695|
|   DEBIT|13318.535518284714|
| PAYMENT|12556.450185716356|
+--------+------------------+

Group map UDF

As with the Group Aggregate UDF, we use groupBy() to divide a Spark DataFrame into manageable batches. Each input batch is mapped over by the Group Map UDF, resulting in a (Pandas) DataFrame, which is then combined back into a single (Spark) DataFrame.

def normalize_by_type(data: pd.DataFrame) -> pd.DataFrame:
    result = data[['type', 'amount']].copy()
    maxVal = result['amount'].max()
    minVal = result['amount'].min()
    if maxVal == minVal:
        result['amountNorm'] = 0.5
    else:
        result['amountNorm'] = (result['amount'] - minVal) / (maxVal - minVal)
    return result

# We can use the SQL string-based schema like below comment
# schema = 'type string, amount double, amountNorm double'
schema = T.StructType([
    T.StructField('type', T.StringType()),
    T.StructField('amount', T.DoubleType()),
    T.StructField('amountNorm', T.DoubleType())
])

(
    df.groupBy('type')
    .applyInPandas(normalize_by_type, schema)
    .show(10)
)

+--------+---------+--------------------+
|    type|   amount|          amountNorm|
+--------+---------+--------------------+
|TRANSFER|    181.0|1.929785364412691...|
|TRANSFER| 215310.3| 0.00232902269229461|
|TRANSFER|311685.89|0.003371535041334062|
|TRANSFER|  62610.8|6.772443276469881E-4|
|TRANSFER| 42712.39|4.619995945019032E-4|
|TRANSFER| 77957.68|8.432543299642404E-4|
|TRANSFER| 17231.46|1.863677235062513...|
|TRANSFER| 78766.03|8.519983994671721E-4|
|TRANSFER|224606.64|0.002429582898990...|
|TRANSFER|125872.53|0.001361558008596...|
+--------+---------+--------------------+
only showing top 10 rows

You can see that in the example above, we don’t need to explicitly create a UDF. This is due to the use of applyInPandas function which is new in PySpark 3.0.0. The function takes a regular Python function and a result schema as parameters. If you want to create a Group Map UDF, you can refer to the following code:

# It is preferred to use 'applyInPandas' over this API (in Spark 3). 
# This API will be deprecated in the future releases.
# As it will be deprecated soon, type hint inference is not supported. So, we have to specify PandasUDFType explicitly
NormalizeByType = F.pandas_udf(normalize_by_type, schema, F.PandasUDFType.GROUPED_MAP)

(
    df.groupBy('type')
    .apply(NormalizeByType)
    .show(10)
)

+--------+---------+--------------------+
|    type|   amount|          amountNorm|
+--------+---------+--------------------+
|TRANSFER|    181.0|1.929785364412691...|
|TRANSFER| 215310.3| 0.00232902269229461|
|TRANSFER|311685.89|0.003371535041334062|
|TRANSFER|  62610.8|6.772443276469881E-4|
|TRANSFER| 42712.39|4.619995945019032E-4|
|TRANSFER| 77957.68|8.432543299642404E-4|
|TRANSFER| 17231.46|1.863677235062513...|
|TRANSFER| 78766.03|8.519983994671721E-4|
|TRANSFER|224606.64|0.002429582898990...|
|TRANSFER|125872.53|0.001361558008596...|
+--------+---------+--------------------+
only showing top 10 rows

When executes Group Map UDF, Spark will:

Split the data into groups using groupBy.
Apply the function to each group.
Combine the results in a new PySpark DataFrame.

Conclusion

In summary, PySpark UDFs are an effective way to bring the power and flexibility of Python to Spark workloads. When used properly, they can help extend Spark’s capabilities to solve complex data engineering challenges. Together with the previous tutorial, you can now cover most data manipulation and analysis tasks. Happy coding!

A Practical PySpark tutorial for beginners in Jupyter Notebook

Thu, 08 Feb 2024 00:00:00 +0000

Introduction

In today’s world of data, the ability to efficiently process and analyze large amount of data is crucial for businesses and organizations. This is where PySpark comes in - an open-source, distributed computing framework built on top of Apache Spark. With its seamless integration with Python, PySpark allows users to leverage the powerful data processing capabilities of Spark directly from Python scripts.

This post was originally a Jupyter Notebook I created when I started learning PySpark, intended as a cheat sheet for me when working with it. As I started to have a blog (a place for my notes), I decided to update and share it here as a complete hands-on tutorial for beginners.

If you are new to PySpark, this tutorial is for you. We will cover the basic, most practical, syntax of PySpark. By the end of this tutorial, you will have a solid understanding of PySpark and be able to use Spark in Python to perform a wide range of data processing tasks.

Spark vs PySpark

What is PySpark? How is it different from Apache Spark? Before looking at PySpark, it’s essential to understand the relationship between Spark and PySpark.

Apache Spark is an open source distributed computing system. It provides an interface for programming clusters with implicit data parallelism and fault tolerance. Apache Spark provides API for various programming languages, including Python, Java, Scala, R, making it accessible to various audiences to perform data processing tasks.

PySpark, on the other hand, is the library that uses the provided APIs to provide Python support for Spark. It allows developers to use Python, the most popular programming language in the data community, to leverage the power of Spark without having to switch to another language. PySpark also offers seamless integration with other Python libraries.

In short, Spark is the overarching framework, PySpark serves as its Python API, providing a convenient bridge for Python enthusiasts to leverage Spark’s capabilities.

Let’s get started

From this point on, you will see Python code doing Spark. This hands-on tutorial will guide you through basic PySpark operations such as querying, filtering, merging, and grouping data. You can find an executable notebook on my Github.

Installation

There are several ways to install PySpark. The easiest way for Python users is to use pip.

pip install pyspark

SparkSession

SparkSession is the entry point for working with Apache Spark. It provides a unified interface for interacting with Spark functionality, allowing you to create DataFrames, execute SQL queries, and manage Spark configurations. Think of it as the gateway to all Spark operations in your application.

from pyspark.sql import SparkSession

# Get existed or Create new SparkSession
spark = SparkSession.builder.appName('Spark Demo').master('local[*]').getOrCreate()
spark

SparkSession - in-memory

SparkContext

Spark UI

Version    v3.2.1
Master     local[*]
AppName    Spark Demo

Load data

PySpark can load data from various types of data storage. In this tutorial we will use the Fraudulent Transactions Dataset. This dataset provides a CSV file that is sufficient for demo purposes.

The SparkSession object provides read as a property that returns a DataFrameReader that can be used to read data as a DataFrame. The following code reads a csv file into a DataFrame.

# Load CSV file to DataFrame
data_path = '../input/fraudulent-transactions-data/Fraud.csv'
df = spark.read.csv(data_path, header=True, inferSchema=True)
df.printSchema()

root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)

The inferSchema parameter allows Spark to automatically infer the data types of each column based on the actual data in the file. This involves reading a sample of data, which can be computationally expensive. This can also be incorrect, especially if sample data doesn’t represent the entire dataset well.

Alternatively, to achieve better performance and ensure accurate data types, you can define the schema explicitly.

from pyspark.sql import types as T

# Read CSV with pre-defined schema
predefined_schema = T.StructType([
    T.StructField('step', T.IntegerType()),
    T.StructField('type', T.StringType()),
    T.StructField('amount', T.DoubleType()),
    T.StructField('nameOrig', T.StringType()),
    T.StructField('oldbalanceOrg', T.DoubleType()),
    T.StructField('newbalanceOrig', T.DoubleType()), 
    T.StructField('nameDest', T.StringType()),
    T.StructField('oldbalanceDest', T.DoubleType()),
    T.StructField('newbalanceDest', T.DoubleType()), 
    T.StructField('isFraud', T.IntegerType()),
    T.StructField('isFlaggedFraud', T.IntegerType())
])

df = spark.read.csv(data_path, schema=predefined_schema, header=True)

root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)

The data set contains some misformatted column names. I will rename them all to camel case.

# Rename columns
corrected_cols = {'oldbalanceOrg': 'oldBalanceOrig', 'newbalanceOrig': 'newBalanceOrig', 
                  'oldbalanceDest': 'oldBalanceDest', 'newbalanceDest': 'newBalanceDest'}
for old_col, new_col in corrected_cols.items():
    df = df.withColumnRenamed(old_col, new_col)

df.printSchema()

root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldBalanceOrig: double (nullable = true)
 |-- newBalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldBalanceDest: double (nullable = true)
 |-- newBalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)

Data Overview

You can quickly look at the data with DataFrame.show which prints the first n rows to the screen.

# Prints top 10 rows of PySpark DataFrame to the screen
df.show(10)

+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
|step|    type|  amount|   nameOrig|oldBalanceOrig|newBalanceOrig|   nameDest|oldBalanceDest|newBalanceDest|isFraud|isFlaggedFraud|
+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
|   1| PAYMENT| 9839.64|C1231006815|      170136.0|     160296.36|M1979787155|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 1864.28|C1666544295|       21249.0|      19384.72|M2044282225|           0.0|           0.0|      0|             0|
|   1|TRANSFER|   181.0|C1305486145|         181.0|           0.0| C553264065|           0.0|           0.0|      1|             0|
|   1|CASH_OUT|   181.0| C840083671|         181.0|           0.0|  C38997010|       21182.0|           0.0|      1|             0|
|   1| PAYMENT|11668.14|C2048537720|       41554.0|      29885.86|M1230701703|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7817.71|  C90045638|       53860.0|      46042.29| M573487274|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7107.77| C154988899|      183195.0|     176087.23| M408069119|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7861.64|C1912850431|     176087.23|     168225.59| M633326333|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 4024.36|C1265012928|        2671.0|           0.0|M1176932104|           0.0|           0.0|      0|             0|
|   1|   DEBIT| 5337.77| C712410124|       41720.0|      36382.23| C195600860|       41898.0|      40348.79|      0|             0|
+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
only showing top 10 rows

In many cases, the result does not fit on the screen and produces unreadable output.

This is where Python comes in. With PySpark, you can mix Python code with Spark APIs to improve the result. The following Python function will show you how to use a Python loop to split and display a sample of data.

# Split columns into subsets and show it accordingly
def show_split(df, split=-1, n_samples=10):
    n_cols = len(df.columns)
    if split <= 0:
        split = n_cols
    i = 0
    j = i + split
    while i < n_cols:
        df.select(*df.columns[i:j]).show(n_samples)
        i = j
        j = i + split

show_split(df, 4, 10)

+----+--------+--------+-----------+
|step|    type|  amount|   nameOrig|
+----+--------+--------+-----------+
|   1| PAYMENT| 9839.64|C1231006815|
|   1| PAYMENT| 1864.28|C1666544295|
|   1|TRANSFER|   181.0|C1305486145|
|   1|CASH_OUT|   181.0| C840083671|
|   1| PAYMENT|11668.14|C2048537720|
|   1| PAYMENT| 7817.71|  C90045638|
|   1| PAYMENT| 7107.77| C154988899|
|   1| PAYMENT| 7861.64|C1912850431|
|   1| PAYMENT| 4024.36|C1265012928|
|   1|   DEBIT| 5337.77| C712410124|
+----+--------+--------+-----------+
only showing top 10 rows

+--------------+--------------+-----------+--------------+
|oldBalanceOrig|newBalanceOrig|   nameDest|oldBalanceDest|
+--------------+--------------+-----------+--------------+
|      170136.0|     160296.36|M1979787155|           0.0|
|       21249.0|      19384.72|M2044282225|           0.0|
|         181.0|           0.0| C553264065|           0.0|
|         181.0|           0.0|  C38997010|       21182.0|
|       41554.0|      29885.86|M1230701703|           0.0|
|       53860.0|      46042.29| M573487274|           0.0|
|      183195.0|     176087.23| M408069119|           0.0|
|     176087.23|     168225.59| M633326333|           0.0|
|        2671.0|           0.0|M1176932104|           0.0|
|       41720.0|      36382.23| C195600860|       41898.0|
+--------------+--------------+-----------+--------------+
only showing top 10 rows

+--------------+-------+--------------+
|newBalanceDest|isFraud|isFlaggedFraud|
+--------------+-------+--------------+
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      1|             0|
|           0.0|      1|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|      40348.79|      0|             0|
+--------------+-------+--------------+
only showing top 10 rows

When working with numerical data, it is not very useful to look at a long series of values. We are often more interested in a few key information points, such as count, mean, standard deviation, minimum, and maximum. PySpark’s DataFrame provides describe and summary function with different usage to present these essential metrics.

# DataFrame.describe take columns as params
df.describe('step', 'amount').show()

+-------+------------------+------------------+
|summary|              step|            amount|
+-------+------------------+------------------+
|  count|           6362620|           6362620|
|   mean|243.39724563151657|179861.90354913412|
| stddev|142.33197104912588| 603858.2314629498|
|    min|                 1|               0.0|
|    max|               743|     9.244551664E7|
+-------+------------------+------------------+

# DataFrame.summary take statistics as params
df.select('oldBalanceOrig', 'newBalanceOrig', 'oldBalanceDest', 'newBalanceDest').summary('count', 'min', 'max', 'mean', '50%').show()

+-------+-----------------+-----------------+------------------+------------------+
|summary|   oldBalanceOrig|   newBalanceOrig|    oldBalanceDest|    newBalanceDest|
+-------+-----------------+-----------------+------------------+------------------+
|  count|          6362620|          6362620|           6362620|           6362620|
|    min|              0.0|              0.0|               0.0|               0.0|
|    max|    5.958504037E7|    4.958504037E7|    3.5601588935E8|    3.5617927892E8|
|   mean|833883.1040744719|855113.6685785714|1100701.6665196654|1224996.3982019408|
|    50%|         14211.23|              0.0|         132612.49|         214605.81|
+-------+-----------------+-----------------+------------------+------------------+

Query data

Select and Filter

PySpark borrowed a lot of vocabulary from the SQL world. But it offers the flexibility that we do not need to follow the strict SQL framework (select what from where if condition met …). Each step of PySpark will return a DataFrame or GroupedData that we can continue to work with normally.

from pyspark.sql import functions as F

# First .where() filter DataFrame and return another DataFrame
# Then .select() select from the returned DataFrame 
df.where(df['type']=='CASH_OUT').select(df.type, F.col('amount')).show(10)

+--------+---------+
|    type|   amount|
+--------+---------+
|CASH_OUT|    181.0|
|CASH_OUT|229133.94|
|CASH_OUT|110414.71|
|CASH_OUT|  56953.9|
|CASH_OUT|  5346.89|
|CASH_OUT|  23261.3|
|CASH_OUT| 82940.31|
|CASH_OUT| 47458.86|
|CASH_OUT|136872.92|
|CASH_OUT| 94253.33|
+--------+---------+
only showing top 10 rows

The above example shows us three different ways to access pyspark columns:

df.type: Access as an attribute.
df['type']: Access as an items.
F.col('type'): Explicitly specify that we need a column, not a string literal.

You can also filter multiple conditions using &, |, and ~ operator.

# PySpark example filter multiple conditions
df.where((F.col('type')=='CASH_OUT') & (F.col('amount') > 500)).show(10)

For users who are more familiar with SQL syntax, Spark provides the ability to write SQL queries directly. Before writing SQL queries in PySpark, you need to register your DataFrame. This allows you to reference it in your SQL queries.

# Create or replace temp view named "df" from DataFrame df in PySpark
df.createOrReplaceTempView('df')

# Spark SQL query example. You can now reference df in your query
spark.sql('''
    SELECT type, amount 
    FROM df
    WHERE type = "CASH_OUT"    
''').show(10)

+--------+---------+
|    type|   amount|
+--------+---------+
|CASH_OUT|    181.0|
|CASH_OUT|229133.94|
|CASH_OUT|110414.71|
|CASH_OUT|  56953.9|
|CASH_OUT|  5346.89|
|CASH_OUT|  23261.3|
|CASH_OUT| 82940.31|
|CASH_OUT| 47458.86|
|CASH_OUT|136872.92|
|CASH_OUT| 94253.33|
+--------+---------+
only showing top 10 rows

Aggregating with `groupBy`

PySpark provides a similar syntax to Pandas for aggregating data.

# Example to PySpark groupBy
# Sometimes we can pass column name directly to pyspark functions
# `Column.alias` method change the name of the result column.
df.select('type', 'amount').groupBy('type').agg(F.mean('amount').alias('avgAmount')).orderBy('avgAmount').show(10)

spark.sql('''
    SELECT type, AVG(amount) avgAmount
    FROM df
    GROUP BY type
    ORDER BY 2
''').show(10)

+--------+------------------+
|    type|         avgAmount|
+--------+------------------+
|   DEBIT| 5483.665313767128|
| PAYMENT|13057.604660187604|
| CASH_IN| 168920.2420040954|
|CASH_OUT|176273.96434613998|
|TRANSFER| 910647.0096454868|
+--------+------------------+

To filter after groupBy, we can just simply apply where or filter to the result DataFrame object or follow SQL framework with having keyword.

(
    df.where(df['type']=='CASH_OUT')
    .groupBy('nameOrig')
    .agg(F.sum('amount').alias('sumAmount'))
    .where(F.col('sumAmount') > 300000)
    .show(10)
)

spark.sql('''
    SELECT nameOrig, SUM(amount) sumAmount
    FROM df
    WHERE type = "CASH_OUT"
    GROUP BY 1
    HAVING sumAmount > 300000
''').show(10)

+-----------+---------+
|   nameOrig|sumAmount|
+-----------+---------+
| C551314014|301050.58|
| C661668091|323789.56|
| C228994633|517946.01|
|C1591008292|558254.22|
|C2100435651|357988.09|
| C624052656|476735.47|
| C948681098|353759.28|
|  C50682517|386128.82|
|C1579521009|684561.18|
|C1871922377|394317.12|
+-----------+---------+
only showing top 10 rows

Union and Intersection

df.select('nameOrig').union(df.select('nameDest')).count()

12725240

spark.sql('''
    SELECT nameOrig from df
    UNION
    SELECT nameDest from df
''').count()

We can see the difference in the count here. The reason is that PySpark union function keeps the duplicate samples from two sets. This is equivalent to UNION ALL in SQL. By default, PySpark will not remove duplidates as it is an expensive process. If you want to drop duplicates, you have to do it explicitly.

# Union and drop duplicates in PySpark
df.select('nameOrig').union(df.select('nameDest')).dropDuplicates().count()

Unioning can be useful when we are reading data from multiple files. We can read them one by one in a Python loop and union them.

Intersection is similar to Union. But, keep in mind that PySpark intersect is equivalent to SQL INTERSECT, not INTERSECT ALL.

Join

Very similar to Pandas, DataFrame.join method joins a DataFrame with another DataFrame using the given join expression.

(
    df.where('type = "CASH_IN" OR type = "CASH_OUT"')
    .selectExpr('nameOrig', 'ABS(newBalanceOrig - oldBalanceOrig) changeOrig')
    .groupBy('nameOrig')
    .agg(
        F.mean(F.col('changeOrig')).alias('avgChangeOrig'),
        F.count('*').alias('occOrig')
    )
    .where('avgChangeOrig > 100000')
    # Join the above DataFrame with the one provided in parameter
    .join((
        df.where('type = "CASH_IN" OR type = "CASH_OUT"')
        .selectExpr('nameDest', 'ABS(newBalanceDest - oldBalanceDest) changeDest')
        .groupBy('nameDest')
        .agg(
            F.mean(F.col('changeDest')).alias('avgChangeDest'),
            F.count('*').alias('occDest')
        )
        .where('avgChangeDest > 100000')
    ), on=F.col('nameOrig')==F.col('nameDest'), how='inner')
    # There are several join method: inner, left, right, cross, outer, left_outer, right_outer, left_semi, left_anti, right_semi, right_anti, ...
    .selectExpr('nameOrig name', 'occOrig + occDest occ', 'avgChangeOrig', 'avgChangeDest')
    .orderBy('occ', ascending=False)
).show(10)

spark.sql('''
    SELECT nameOrig name, occOrig + occDest occ, avgChangeOrig, avgChangeDest
    FROM
    (
        SELECT nameOrig, AVG(ABS(newBalanceOrig - oldBalanceOrig)) avgChangeOrig, COUNT(*) occOrig
        FROM df
        WHERE type = "CASH_IN" OR type = "CASH_OUT"
        GROUP BY nameOrig
        HAVING avgChangeOrig > 100000
    )
    INNER JOIN
    (
        SELECT nameDest, AVG(ABS(newBalanceDest - oldBalanceDest)) avgChangeDest, COUNT(*) occDest
        FROM df
        WHERE type = "CASH_IN" OR type = "CASH_OUT"
        GROUP BY nameDest
        HAVING avgChangeDest > 100000
    )
    ON nameOrig = nameDest
    ORDER BY occ DESC
''').show(10)

+-----------+---+------------------+------------------+
|       name|occ|     avgChangeOrig|     avgChangeDest|
+-----------+---+------------------+------------------+
|C1552859894| 43|193711.30000000005| 763241.1652380949|
|C1819271729| 37|         278937.79|283626.17805555544|
|C1692434834| 37|177369.73000000045| 438853.7616666666|
| C889762313| 32|         132731.31|211437.18741935486|
|C1868986147| 32|         120594.03|249840.37709677417|
|  C55305556| 28|319860.45999999903|225565.42111111112|
| C636092700| 26|217273.86000000004|201888.05279999998|
|C1713505653| 25| 278622.8400000003|186625.34916666665|
|C2029542508| 24| 235760.1200000001|231022.98217391354|
| C699906968| 23| 177813.3799999999| 183054.3072727272|
+-----------+---+------------------+------------------+
only showing top 10 rows

In the above example, I demonstrated mixing Python Spark and SQL syntax for cleaner code. Instead of the verbose expression:

df.where((F.col('type')=='CASH_IN') | (F.col('type')=='CASH_OUT'))

You can write:

df.where('type = "CASH_IN" OR type = "CASH_OUT"')

This style can be applied in various Python Spark functions: selectExpr, where, filter, expr,… Choose your preferred coding style – PySpark offers the flexibility.

Endnote

This tutorial has covered basic Spark operations in both Python and SQL syntax. You will be able to perform most common data transformation and analysis tasks. But your Spark journey doesn’t end here! There are more advanced features that were not covered in this article (e.g., UDF). They will be discussed in another post soon.

spark on Dat a Engineer

PySpark UDFs: A comprehensive guide to unlock PySpark potential

Introduction

Understanding PySpark UDFs

Creating an UDF

Pandas UDF

Series to Series UDF

Iterator of Series to Iterator of Series

Iterator of multiple Series to Iterator of Series UDF

Group aggregate UDF

Group map UDF

Conclusion

A Practical PySpark tutorial for beginners in Jupyter Notebook

Introduction

Spark vs PySpark

Let’s get started

Installation

SparkSession

Load data

Data Overview

Query data

Select and Filter

Aggregating with groupBy

Union and Intersection

Join

Endnote

Aggregating with `groupBy`