<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Spark on Dat a Engineer</title><link>https://note.datengineer.dev/tags/spark/</link><description>Recent content in Spark on Dat a Engineer</description><image><title>Dat a Engineer</title><url>https://note.datengineer.dev/images/cover.png</url><link>https://note.datengineer.dev/images/cover.png</link></image><generator>Hugo -- 0.147.5</generator><language>en-us</language><lastBuildDate>Fri, 09 Feb 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://note.datengineer.dev/tags/spark/index.xml" rel="self" type="application/rss+xml"/><item><title>PySpark UDFs: A comprehensive guide to unlock PySpark potential</title><link>https://note.datengineer.dev/posts/pyspark-udfs-a-comprehensive-guide-to-unlock-pyspark-potential/</link><pubDate>Fri, 09 Feb 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/pyspark-udfs-a-comprehensive-guide-to-unlock-pyspark-potential/</guid><description>Discover the capabilities of User-Defined Functions (UDFs) in Apache Spark, allowing you to extend PySpark&amp;#39;s functionality and solve complex data processing tasks.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Apache Spark is a powerful open source distributed computing engine designed to handle large datasets across clusters. PySpark is the Python programming API for Spark. It allows data engineers and data scientists can easily utilize the framework in their preferred language.</p>
<p>This post is a continuation of the <a href="../a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/">previous tutorial</a>. Originally a Jupyter notebook I created while learning PySpark, I recently found it and decided to update it and publish it on my blog.</p>
<p>UDFs (user-defined functions) are an integral part of PySpark, allowing users to extend the capabilities of Spark by creating their own custom functions. This article will provide a comprehensive guide to PySpark UDFs with examples.</p>
<h2 id="understanding-pyspark-udfs">Understanding PySpark UDFs</h2>
<p>PySpark UDFs are user-defined functions written in Python code. We create functions in Python and register them with Spark as UDFs. They enable the execution of complicated custom logic on Spark DataFrames and SQL expressions.</p>
<p>However, note that UDFs are expensive. We should always prefer built-in functions whenever possible. PySpark comes with a number of predefined common functions, and many more new functions are added with each new release.</p>
<p>In summary, with PySpark UDFs, what goes in is a regular Python function, and what goes out is a function to work on the PySpark engine.</p>
<h2 id="creating-an-udf">Creating an UDF</h2>
<p>All of the following examples are a continuation of the <a href="../a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/">previous article</a>. You can find an executable notebook containing both articles <a href="https://gist.github.com/ThaiDat/81c3662801aa8410a65b94f3c993c377">here</a>.</p>
<p>Below is an example of a &ldquo;complicated&rdquo; decision tree function that classifies transactions:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># PySpark UDFs example</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">classify_tier</span><span class="p">(</span><span class="n">amount</span><span class="p">:</span><span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">500</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">10000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">100000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">2</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">1000000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">3</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">4</span>
</span></span></code></pre></div><p>It is a regular Python function that receive a <code>float</code> and return an <code>int</code>. We have to make it a PySpark UDF before actually using it.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># pyspark.sql.functions provides a udf() function to promote a regular function to be UDF.</span>
</span></span><span class="line"><span class="cl"><span class="c1"># The function takes two parameters: the function you want to promote, and the return type of the generated UDF</span>
</span></span><span class="line"><span class="cl"><span class="c1"># The function return a UDF</span>
</span></span><span class="line"><span class="cl"><span class="n">classifyTier</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">udf</span><span class="p">(</span><span class="n">classify_tier</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">ByteType</span><span class="p">())</span>
</span></span></code></pre></div><p>Then we can use it like any other PySpark function.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">,</span> <span class="n">classifyTier</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">amount</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;tier&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;tier&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----------+----+
|   nameOrig|tier|
+-----------+----+
|C1495608502|   4|
|C1321115948|   4|
| C476579021|   4|
|C1520267010|   4|
| C106297322|   4|
|C1464177809|   4|
| C355885103|   4|
|C1057507014|   4|
|C1419332030|   4|
|C2007599722|   4|
+-----------+----+
</code></pre><p>The <code>pyspark.sql.functions.udf()</code> function can also be used as a decorator which produce the same result.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># pyspark udf decorator example</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Note that classifyTier is a UDF, not a regular function anymore.</span>
</span></span><span class="line"><span class="cl"><span class="nd">@F.udf</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">ByteType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">classifyTier</span><span class="p">(</span><span class="n">amount</span><span class="p">:</span><span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">500</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">10000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">100000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">2</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="mi">1000000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">3</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">4</span>
</span></span></code></pre></div><p>If you want to use it in a Spark SQL expression, we need to register it first.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Register the regular Python function with spark.udf.register</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">udf</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="s1">&#39;classifyTier&#39;</span><span class="p">,</span> <span class="n">classify_tier</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameOrig, classifyTier(amount) tier
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">    ORDER BY tier DESC 
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----------+----+
|   nameOrig|tier|
+-----------+----+
| C263860433|   4|
| C306269750|   4|
|C1611915976|   4|
|C1387188921|   4|
| C300262358|   4|
| C389879985|   4|
|C1907016309|   4|
|C1046638041|   4|
|C1543404166|   4|
|C1155108056|   4|
+-----------+----+
</code></pre><p>Simple enough? Write a Python function, make it a UDF, use it. But it is not the most interesting part.</p>
<h2 id="pandas-udf">Pandas UDF</h2>
<p>With Python UDFs, PySpark will unpack each value, perform the calculation, and then return the value for each record. A Pandas UDF is a user-defined function that works with data using Pandas for manipulation and Apache Arrow for data transfer. It is also called a vectorized UDF. Compared to row-at-a-time Python UDFs, pandas UDFs enable vectorized operations that can improve performance by up to 100x.</p>
<h3 id="series-to-series-udf">Series to Series UDF</h3>
<p>These UDFs operate on Pandas Series and return a Pandas Series as output. When Spark runs a Pandas UDF, it divides the columns into batches, calls the function on a subset of the data for each batch, and then concatenates the output. It is preferable to use a Pandas Series-to-Series UDF if possible, instead of using a regular Python UDF. We use <code>pyspark.sql.functions.pandas_udf</code> to create a Pandas UDF.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># You can also promote the function to PySpark Pandas UDF as getUserType = F.pandas_udf(get_user_type, T.StringType())</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Each User ID starts with a letter represent its type</span>
</span></span><span class="line"><span class="cl"><span class="nd">@F.pandas_udf</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">getUserType</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">name</span><span class="o">.</span><span class="n">str</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span></code></pre></div><p>The only difference in syntax is that the Python function now takes a <code>pandas.Series' and returns a </code>pandas.Series&rsquo;. And then we can use it as a Spark function.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">getUserType</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">nameDest</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;userTypeDest&#39;</span><span class="p">),</span> <span class="n">df</span><span class="o">.</span><span class="n">amount</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;userTypeDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;n&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+------------+------------------+-------+
|userTypeDest|         avgAmount|      n|
+------------+------------------+-------+
|           C| 265083.4571810173|4211125|
|           M|13057.604660187604|2151495|
+------------+------------------+-------+
</code></pre><h3 id="iterator-of-series-to-iterator-of-series">Iterator of Series to Iterator of Series</h3>
<p>Due to the distributed nature of Spark, the entire series is not fed into the UDF at once; instead, each cluster calls the UDF on its own batch of data and then aggregates the result. PySpark Iterator of Series to Iterator of Series UDFs are very useful when we have an time-consuming cold start operation (e.g. initialize a machine learning model, check for some server statuses,&hellip;) that you need to perform once at the beginning of the processing step.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">sleep</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Iterator</span><span class="p">,</span> <span class="n">Tuple</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@F.pandas_udf</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">ByteType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">getNameIdLength</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="n">Iterator</span><span class="p">[</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">Iterator</span><span class="p">[</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Heavy task</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># sleep(5)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1"># name is a Iterator</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># name_batch is a pd.Series</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">name_batch</span> <span class="ow">in</span> <span class="n">name</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">name_len</span> <span class="o">=</span> <span class="n">name_batch</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">len</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="n">name_len</span><span class="p">[</span><span class="o">~</span><span class="n">name_batch</span><span class="o">.</span><span class="n">str</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">isnumeric</span><span class="p">()]</span> <span class="o">-=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># yield because we return an iterator</span>
</span></span><span class="line"><span class="cl">        <span class="k">yield</span> <span class="n">name_len</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">getNameIdLength</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">nameOrig</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;idLen&#39;</span><span class="p">),</span> <span class="s1">&#39;amount&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;idLen&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----+------------------+
|idLen|         avgAmount|
+-----+------------------+
|    4|155070.73742857145|
|    7|177477.50726081585|
|   10| 179702.4408980949|
|    9|179898.05510125632|
|    8| 181572.2097899971|
|    6|197756.81529433408|
|    5|199594.79368029739|
+-----+------------------+
</code></pre><h3 id="iterator-of-multiple-series-to-iterator-of-series-udf">Iterator of multiple Series to Iterator of Series UDF</h3>
<p>Iterator of Multiple Series to Iterator of Series UDF has the same characteristics as Iterator of Series to Iterator of Series UDF. The difference is that the underlying Python function receives an iterator for a tuple of Pandas Series.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">amount_mismatch</span><span class="p">(</span><span class="n">values</span><span class="p">:</span> <span class="n">Iterator</span><span class="p">[</span><span class="n">Tuple</span><span class="p">[</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">]])</span> <span class="o">-&gt;</span> <span class="n">Iterator</span><span class="p">[</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Heavy task</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">oldOrig</span><span class="p">,</span> <span class="n">newOrig</span><span class="p">,</span> <span class="n">oldDest</span><span class="p">,</span> <span class="n">newDest</span> <span class="ow">in</span> <span class="n">values</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">yield</span> <span class="nb">abs</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">newOrig</span> <span class="o">-</span> <span class="n">oldOrig</span><span class="p">)</span> <span class="o">-</span> <span class="nb">abs</span><span class="p">(</span><span class="n">newDest</span> <span class="o">-</span> <span class="n">oldDest</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Create an UDF. You can also use decorator.</span>
</span></span><span class="line"><span class="cl"><span class="n">amountMismatch</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">pandas_udf</span><span class="p">(</span><span class="n">amount_mismatch</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">())</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">df</span><span class="o">.</span><span class="n">type</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">amountMismatch</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">oldBalanceOrig</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">newBalanceOrig</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">oldBalanceDest</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">newBalanceDest</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;mismatch&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;mismatch&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgMismatch&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;avgMismatch&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+------------------+
|    type|       avgMismatch|
+--------+------------------+
|TRANSFER| 968056.4538892006|
|CASH_OUT|170539.39652580014|
| CASH_IN| 50038.95466155722|
|   DEBIT| 25567.53969902471|
| PAYMENT| 6378.936662041953|
+--------+------------------+
</code></pre><h3 id="group-aggregate-udf">Group aggregate UDF</h3>
<p>Group aggregate UDF, also known as the Series to Scalar UDF, reduces the input <code>pandas.Series</code> into a single value.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@F.pandas_udf</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">getStdDeviation</span><span class="p">(</span><span class="n">series</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Use built-in pandas.Series.std</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">series</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">getStdDeviation</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">amount</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;var&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;var&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+------------------+
|    type|               var|
+--------+------------------+
|TRANSFER|1879573.5289080725|
|CASH_OUT|175329.74448347004|
| CASH_IN|126508.25527180695|
|   DEBIT|13318.535518284714|
| PAYMENT|12556.450185716356|
+--------+------------------+
</code></pre><h3 id="group-map-udf">Group map UDF</h3>
<p>As with the Group Aggregate UDF, we use <code>groupBy()</code> to divide a Spark <code>DataFrame</code> into manageable batches. Each input batch is mapped over by the Group Map UDF, resulting in a (Pandas) <code>DataFrame</code>, which is then combined back into a single (Spark) <code>DataFrame</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">normalize_by_type</span><span class="p">(</span><span class="n">data</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">result</span> <span class="o">=</span> <span class="n">data</span><span class="p">[[</span><span class="s1">&#39;type&#39;</span><span class="p">,</span> <span class="s1">&#39;amount&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">maxVal</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s1">&#39;amount&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">minVal</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s1">&#39;amount&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">maxVal</span> <span class="o">==</span> <span class="n">minVal</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">result</span><span class="p">[</span><span class="s1">&#39;amountNorm&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.5</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">result</span><span class="p">[</span><span class="s1">&#39;amountNorm&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s1">&#39;amount&#39;</span><span class="p">]</span> <span class="o">-</span> <span class="n">minVal</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">maxVal</span> <span class="o">-</span> <span class="n">minVal</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">result</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># We can use the SQL string-based schema like below comment</span>
</span></span><span class="line"><span class="cl"><span class="c1"># schema = &#39;type string, amount double, amountNorm double&#39;</span>
</span></span><span class="line"><span class="cl"><span class="n">schema</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">StructType</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;amountNorm&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">applyInPandas</span><span class="p">(</span><span class="n">normalize_by_type</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+---------+--------------------+
|    type|   amount|          amountNorm|
+--------+---------+--------------------+
|TRANSFER|    181.0|1.929785364412691...|
|TRANSFER| 215310.3| 0.00232902269229461|
|TRANSFER|311685.89|0.003371535041334062|
|TRANSFER|  62610.8|6.772443276469881E-4|
|TRANSFER| 42712.39|4.619995945019032E-4|
|TRANSFER| 77957.68|8.432543299642404E-4|
|TRANSFER| 17231.46|1.863677235062513...|
|TRANSFER| 78766.03|8.519983994671721E-4|
|TRANSFER|224606.64|0.002429582898990...|
|TRANSFER|125872.53|0.001361558008596...|
+--------+---------+--------------------+
only showing top 10 rows
</code></pre><p>You can see that in the example above, we don&rsquo;t need to explicitly create a UDF. This is due to the use of applyInPandas function which is new in PySpark 3.0.0. The function takes a regular Python function and a result schema as parameters. If you want to create a Group Map UDF, you can refer to the following code:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># It is preferred to use &#39;applyInPandas&#39; over this API (in Spark 3). </span>
</span></span><span class="line"><span class="cl"><span class="c1"># This API will be deprecated in the future releases.</span>
</span></span><span class="line"><span class="cl"><span class="c1"># As it will be deprecated soon, type hint inference is not supported. So, we have to specify PandasUDFType explicitly</span>
</span></span><span class="line"><span class="cl"><span class="n">NormalizeByType</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">pandas_udf</span><span class="p">(</span><span class="n">normalize_by_type</span><span class="p">,</span> <span class="n">schema</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">PandasUDFType</span><span class="o">.</span><span class="n">GROUPED_MAP</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">NormalizeByType</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+---------+--------------------+
|    type|   amount|          amountNorm|
+--------+---------+--------------------+
|TRANSFER|    181.0|1.929785364412691...|
|TRANSFER| 215310.3| 0.00232902269229461|
|TRANSFER|311685.89|0.003371535041334062|
|TRANSFER|  62610.8|6.772443276469881E-4|
|TRANSFER| 42712.39|4.619995945019032E-4|
|TRANSFER| 77957.68|8.432543299642404E-4|
|TRANSFER| 17231.46|1.863677235062513...|
|TRANSFER| 78766.03|8.519983994671721E-4|
|TRANSFER|224606.64|0.002429582898990...|
|TRANSFER|125872.53|0.001361558008596...|
+--------+---------+--------------------+
only showing top 10 rows
</code></pre><p>When executes Group Map UDF, Spark will:</p>
<ul>
<li>Split the data into groups using <code>groupBy</code>.</li>
<li>Apply the function to each group.</li>
<li>Combine the results in a new PySpark <code>DataFrame</code>.</li>
</ul>
<p><img alt="Python Spark User Defined Function Group Map UDF workflow" loading="lazy" src="/posts/pyspark-udfs-a-comprehensive-guide-to-unlock-pyspark-potential/images/pyspark-udf-spark-python-group-map-udf-workflow.png"></p>
<h2 id="conclusion">Conclusion</h2>
<p>In summary, PySpark UDFs are an effective way to bring the power and flexibility of Python to Spark workloads. When used properly, they can help extend Spark&rsquo;s capabilities to solve complex data engineering challenges. Together with the previous tutorial, you can now cover most data manipulation and analysis tasks. Happy coding!</p>
]]></content:encoded></item><item><title>A Practical PySpark tutorial for beginners in Jupyter Notebook</title><link>https://note.datengineer.dev/posts/a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/</link><pubDate>Thu, 08 Feb 2024 00:00:00 +0000</pubDate><guid>https://note.datengineer.dev/posts/a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/</guid><description>A hands-on PySpark cheat sheet</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>In today&rsquo;s world of data, the ability to efficiently process and analyze large amount of data is crucial for businesses and organizations. This is where PySpark comes in - an open-source, distributed computing framework built on top of Apache Spark. With its seamless integration with Python, PySpark allows users to leverage the powerful data processing capabilities of Spark directly from Python scripts.</p>
<p>This post was originally a Jupyter Notebook I created when I started learning PySpark, intended as a cheat sheet for me when working with it. As I started to have a blog (a place for my notes), I decided to update and share it here as a complete hands-on tutorial for beginners.</p>
<p>If you are new to PySpark, this tutorial is for you. We will cover the basic, most practical, syntax of PySpark. By the end of this tutorial, you will have a solid understanding of PySpark and be able to use Spark in Python to perform a wide range of data processing tasks.</p>
<h2 id="spark-vs-pyspark">Spark vs PySpark</h2>
<p>What is PySpark? How is it different from Apache Spark? Before looking at PySpark, it&rsquo;s essential to understand the relationship between Spark and PySpark.</p>
<p>Apache Spark is an open source distributed computing system. It provides an interface for programming clusters with implicit data parallelism and fault tolerance. Apache Spark provides API for various programming languages, including Python, Java, Scala, R, making it accessible to various audiences to perform data processing tasks.</p>
<p>PySpark, on the other hand, is the library that uses the provided APIs to provide Python support for Spark. It allows developers to use Python, the most popular programming language in the data community, to leverage the power of Spark without having to switch to another language. PySpark also offers seamless integration with other Python libraries.</p>
<p>In short, Spark is the overarching framework, PySpark serves as its Python API, providing a convenient bridge for Python enthusiasts to leverage Spark&rsquo;s capabilities.</p>
<p><img alt="Apache Spark vs Python PySpark different" loading="lazy" src="/posts/a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/images/apache-spark-python-pyspark-difference-architecture.png"></p>
<h2 id="lets-get-started">Let&rsquo;s get started</h2>
<p>From this point on, you will see Python code doing Spark. This hands-on tutorial will guide you through basic PySpark operations such as querying, filtering, merging, and grouping data. You can find an executable notebook on my <a href="https://gist.github.com/ThaiDat/81c3662801aa8410a65b94f3c993c377">Github</a>.</p>
<h3 id="installation">Installation</h3>
<p>There are several ways to install PySpark. The easiest way for Python users is to use <code>pip</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pip install pyspark
</span></span></code></pre></div><h3 id="sparksession">SparkSession</h3>
<p><code>SparkSession</code> is the entry point for working with Apache Spark. It provides a unified interface for interacting with Spark functionality, allowing you to create DataFrames, execute SQL queries, and manage Spark configurations. Think of it as the gateway to all Spark operations in your application.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Get existed or Create new SparkSession</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">appName</span><span class="p">(</span><span class="s1">&#39;Spark Demo&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">master</span><span class="p">(</span><span class="s1">&#39;local[*]&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span>
</span></span></code></pre></div><pre tabindex="0"><code>SparkSession - in-memory

SparkContext

Spark UI

Version    v3.2.1
Master     local[*]
AppName    Spark Demo
</code></pre><h3 id="load-data">Load data</h3>
<p>PySpark can load data from various types of data storage. In this tutorial we will use the <a href="https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data/data">Fraudulent Transactions Dataset</a>. This dataset provides a CSV file that is sufficient for demo purposes.</p>
<p>The SparkSession object provides <code>read</code> as a property that returns a <code>DataFrameReader</code> that can be used to read data as a <code>DataFrame</code>. The following code reads a csv file into a DataFrame.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Load CSV file to DataFrame</span>
</span></span><span class="line"><span class="cl"><span class="n">data_path</span> <span class="o">=</span> <span class="s1">&#39;../input/fraudulent-transactions-data/Fraud.csv&#39;</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">csv</span><span class="p">(</span><span class="n">data_path</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">inferSchema</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)
</code></pre><p>The inferSchema parameter allows Spark to automatically infer the data types of each column based on the actual data in the file. This involves reading a sample of data, which can be computationally expensive. This can also be incorrect, especially if sample data doesn’t represent the entire dataset well.</p>
<p>Alternatively, to achieve better performance and ensure accurate data types, you can define the schema explicitly.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">types</span> <span class="k">as</span> <span class="n">T</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Read CSV with pre-defined schema</span>
</span></span><span class="line"><span class="cl"><span class="n">predefined_schema</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">StructType</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;step&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">IntegerType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;oldbalanceOrg&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;newbalanceOrig&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span> 
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">StringType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;oldbalanceDest&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;newbalanceDest&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">DoubleType</span><span class="p">()),</span> 
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;isFraud&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">IntegerType</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl">    <span class="n">T</span><span class="o">.</span><span class="n">StructField</span><span class="p">(</span><span class="s1">&#39;isFlaggedFraud&#39;</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">IntegerType</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">csv</span><span class="p">(</span><span class="n">data_path</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">predefined_schema</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)
</code></pre><p>The data set contains some misformatted column names. I will rename them all to camel case.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Rename columns</span>
</span></span><span class="line"><span class="cl"><span class="n">corrected_cols</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;oldbalanceOrg&#39;</span><span class="p">:</span> <span class="s1">&#39;oldBalanceOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;newbalanceOrig&#39;</span><span class="p">:</span> <span class="s1">&#39;newBalanceOrig&#39;</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                  <span class="s1">&#39;oldbalanceDest&#39;</span><span class="p">:</span> <span class="s1">&#39;oldBalanceDest&#39;</span><span class="p">,</span> <span class="s1">&#39;newbalanceDest&#39;</span><span class="p">:</span> <span class="s1">&#39;newBalanceDest&#39;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">old_col</span><span class="p">,</span> <span class="n">new_col</span> <span class="ow">in</span> <span class="n">corrected_cols</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">withColumnRenamed</span><span class="p">(</span><span class="n">old_col</span><span class="p">,</span> <span class="n">new_col</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldBalanceOrig: double (nullable = true)
 |-- newBalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldBalanceDest: double (nullable = true)
 |-- newBalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)
</code></pre><h3 id="data-overview">Data Overview</h3>
<p>You can quickly look at the data with <code>DataFrame.show</code> which prints the first n rows to the screen.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Prints top 10 rows of PySpark DataFrame to the screen</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
|step|    type|  amount|   nameOrig|oldBalanceOrig|newBalanceOrig|   nameDest|oldBalanceDest|newBalanceDest|isFraud|isFlaggedFraud|
+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
|   1| PAYMENT| 9839.64|C1231006815|      170136.0|     160296.36|M1979787155|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 1864.28|C1666544295|       21249.0|      19384.72|M2044282225|           0.0|           0.0|      0|             0|
|   1|TRANSFER|   181.0|C1305486145|         181.0|           0.0| C553264065|           0.0|           0.0|      1|             0|
|   1|CASH_OUT|   181.0| C840083671|         181.0|           0.0|  C38997010|       21182.0|           0.0|      1|             0|
|   1| PAYMENT|11668.14|C2048537720|       41554.0|      29885.86|M1230701703|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7817.71|  C90045638|       53860.0|      46042.29| M573487274|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7107.77| C154988899|      183195.0|     176087.23| M408069119|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 7861.64|C1912850431|     176087.23|     168225.59| M633326333|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 4024.36|C1265012928|        2671.0|           0.0|M1176932104|           0.0|           0.0|      0|             0|
|   1|   DEBIT| 5337.77| C712410124|       41720.0|      36382.23| C195600860|       41898.0|      40348.79|      0|             0|
+----+--------+--------+-----------+--------------+--------------+-----------+--------------+--------------+-------+--------------+
only showing top 10 rows
</code></pre><p>In many cases, the result does not fit on the screen and produces unreadable output.</p>
<p><img alt="PySpark load CSV show not fit screen" loading="lazy" src="/posts/a-practical-pyspark-tutorial-for-beginners-in-jupyter-notebook/images/pyspark-load-csv-show-not-fit-screen.png"></p>
<p>This is where Python comes in. With PySpark, you can mix Python code with Spark APIs to improve the result. The following Python function will show you how to use a Python loop to split and display a sample of data.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Split columns into subsets and show it accordingly</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">show_split</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">split</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">n_cols</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">split</span> <span class="o">&lt;=</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">split</span> <span class="o">=</span> <span class="n">n_cols</span>
</span></span><span class="line"><span class="cl">    <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="n">j</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="n">split</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n_cols</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="o">*</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">j</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">n_samples</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">i</span> <span class="o">=</span> <span class="n">j</span>
</span></span><span class="line"><span class="cl">        <span class="n">j</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="n">split</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">show_split</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+----+--------+--------+-----------+
|step|    type|  amount|   nameOrig|
+----+--------+--------+-----------+
|   1| PAYMENT| 9839.64|C1231006815|
|   1| PAYMENT| 1864.28|C1666544295|
|   1|TRANSFER|   181.0|C1305486145|
|   1|CASH_OUT|   181.0| C840083671|
|   1| PAYMENT|11668.14|C2048537720|
|   1| PAYMENT| 7817.71|  C90045638|
|   1| PAYMENT| 7107.77| C154988899|
|   1| PAYMENT| 7861.64|C1912850431|
|   1| PAYMENT| 4024.36|C1265012928|
|   1|   DEBIT| 5337.77| C712410124|
+----+--------+--------+-----------+
only showing top 10 rows

+--------------+--------------+-----------+--------------+
|oldBalanceOrig|newBalanceOrig|   nameDest|oldBalanceDest|
+--------------+--------------+-----------+--------------+
|      170136.0|     160296.36|M1979787155|           0.0|
|       21249.0|      19384.72|M2044282225|           0.0|
|         181.0|           0.0| C553264065|           0.0|
|         181.0|           0.0|  C38997010|       21182.0|
|       41554.0|      29885.86|M1230701703|           0.0|
|       53860.0|      46042.29| M573487274|           0.0|
|      183195.0|     176087.23| M408069119|           0.0|
|     176087.23|     168225.59| M633326333|           0.0|
|        2671.0|           0.0|M1176932104|           0.0|
|       41720.0|      36382.23| C195600860|       41898.0|
+--------------+--------------+-----------+--------------+
only showing top 10 rows

+--------------+-------+--------------+
|newBalanceDest|isFraud|isFlaggedFraud|
+--------------+-------+--------------+
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      1|             0|
|           0.0|      1|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|           0.0|      0|             0|
|      40348.79|      0|             0|
+--------------+-------+--------------+
only showing top 10 rows
</code></pre><p>When working with numerical data, it is not very useful to look at a long series of values. We are often more interested in a few key information points, such as count, mean, standard deviation, minimum, and maximum. PySpark&rsquo;s <code>DataFrame</code> provides <code>describe</code> and <code>summary</code> function with different usage to present these essential metrics.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># DataFrame.describe take columns as params</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">(</span><span class="s1">&#39;step&#39;</span><span class="p">,</span> <span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-------+------------------+------------------+
|summary|              step|            amount|
+-------+------------------+------------------+
|  count|           6362620|           6362620|
|   mean|243.39724563151657|179861.90354913412|
| stddev|142.33197104912588| 603858.2314629498|
|    min|                 1|               0.0|
|    max|               743|     9.244551664E7|
+-------+------------------+------------------+
</code></pre><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># DataFrame.summary take statistics as params</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;oldBalanceOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;newBalanceOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;oldBalanceDest&#39;</span><span class="p">,</span> <span class="s1">&#39;newBalanceDest&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s1">&#39;count&#39;</span><span class="p">,</span> <span class="s1">&#39;min&#39;</span><span class="p">,</span> <span class="s1">&#39;max&#39;</span><span class="p">,</span> <span class="s1">&#39;mean&#39;</span><span class="p">,</span> <span class="s1">&#39;50%&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-------+-----------------+-----------------+------------------+------------------+
|summary|   oldBalanceOrig|   newBalanceOrig|    oldBalanceDest|    newBalanceDest|
+-------+-----------------+-----------------+------------------+------------------+
|  count|          6362620|          6362620|           6362620|           6362620|
|    min|              0.0|              0.0|               0.0|               0.0|
|    max|    5.958504037E7|    4.958504037E7|    3.5601588935E8|    3.5617927892E8|
|   mean|833883.1040744719|855113.6685785714|1100701.6665196654|1224996.3982019408|
|    50%|         14211.23|              0.0|         132612.49|         214605.81|
+-------+-----------------+-----------------+------------------+------------------+
</code></pre><h3 id="query-data">Query data</h3>
<h4 id="select-and-filter">Select and Filter</h4>
<p>PySpark borrowed a lot of vocabulary from the SQL world. But it offers the flexibility that we do not need to follow the strict SQL framework (select &hellip; from &hellip; where &hellip;). Each step of PySpark will return a <code>DataFrame</code> or <code>GroupedData</code> that we can continue to work with normally.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># First .where() filter DataFrame and return another DataFrame</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Then .select() select from the returned DataFrame </span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;CASH_OUT&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">type</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+---------+
|    type|   amount|
+--------+---------+
|CASH_OUT|    181.0|
|CASH_OUT|229133.94|
|CASH_OUT|110414.71|
|CASH_OUT|  56953.9|
|CASH_OUT|  5346.89|
|CASH_OUT|  23261.3|
|CASH_OUT| 82940.31|
|CASH_OUT| 47458.86|
|CASH_OUT|136872.92|
|CASH_OUT| 94253.33|
+--------+---------+
only showing top 10 rows
</code></pre><p>The above example shows us three different ways to access pyspark columns:</p>
<ul>
<li><code>df.type</code>: Access as an attribute.</li>
<li><code>df['type']</code>: Access as an items.</li>
<li><code>F.col('type')</code>: Explicitly specify that we need a column, not a string literal.</li>
</ul>
<p>You can also filter multiple conditions using <code>&amp;</code>, <code>|</code>, and <code>~</code> operator.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># PySpark example filter multiple conditions</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">((</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span><span class="o">==</span><span class="s1">&#39;CASH_OUT&#39;</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">500</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><p>For users who are more familiar with SQL syntax, Spark provides the ability to write SQL queries directly. Before writing SQL queries in PySpark, you need to register your <code>DataFrame</code>. This allows you to reference it in your SQL queries.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Create or replace temp view named &#34;df&#34; from DataFrame df in PySpark</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s1">&#39;df&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Spark SQL query example. You can now reference df in your query</span>
</span></span><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT type, amount 
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">    WHERE type = &#34;CASH_OUT&#34;    
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+---------+
|    type|   amount|
+--------+---------+
|CASH_OUT|    181.0|
|CASH_OUT|229133.94|
|CASH_OUT|110414.71|
|CASH_OUT|  56953.9|
|CASH_OUT|  5346.89|
|CASH_OUT|  23261.3|
|CASH_OUT| 82940.31|
|CASH_OUT| 47458.86|
|CASH_OUT|136872.92|
|CASH_OUT| 94253.33|
+--------+---------+
only showing top 10 rows
</code></pre><h4 id="aggregating-with-groupby">Aggregating with <code>groupBy</code></h4>
<p>PySpark provides a similar syntax to Pandas for aggregating data.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Example to PySpark groupBy</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Sometimes we can pass column name directly to pyspark functions</span>
</span></span><span class="line"><span class="cl"><span class="c1"># `Column.alias` method change the name of the result column.</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">,</span> <span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;avgAmount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT type, AVG(amount) avgAmount
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">    GROUP BY type
</span></span></span><span class="line"><span class="cl"><span class="s1">    ORDER BY 2
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+--------+------------------+
|    type|         avgAmount|
+--------+------------------+
|   DEBIT| 5483.665313767128|
| PAYMENT|13057.604660187604|
| CASH_IN| 168920.2420040954|
|CASH_OUT|176273.96434613998|
|TRANSFER| 910647.0096454868|
+--------+------------------+
</code></pre><p>To filter after groupBy, we can just simply apply <code>where</code> or <code>filter</code> to the result <code>DataFrame</code> object or follow SQL framework with having keyword.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;type&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;CASH_OUT&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="s1">&#39;amount&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;sumAmount&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;sumAmount&#39;</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">300000</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameOrig, SUM(amount) sumAmount
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">    WHERE type = &#34;CASH_OUT&#34;
</span></span></span><span class="line"><span class="cl"><span class="s1">    GROUP BY 1
</span></span></span><span class="line"><span class="cl"><span class="s1">    HAVING sumAmount &gt; 300000
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----------+---------+
|   nameOrig|sumAmount|
+-----------+---------+
| C551314014|301050.58|
| C661668091|323789.56|
| C228994633|517946.01|
|C1591008292|558254.22|
|C2100435651|357988.09|
| C624052656|476735.47|
| C948681098|353759.28|
|  C50682517|386128.82|
|C1579521009|684561.18|
|C1871922377|394317.12|
+-----------+---------+
only showing top 10 rows
</code></pre><h4 id="union-and-intersection">Union and Intersection</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">union</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>12725240
</code></pre><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameOrig from df
</span></span></span><span class="line"><span class="cl"><span class="s1">    UNION
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameDest from df
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>9073900
</code></pre><p>We can see the difference in the count here. The reason is that PySpark <code>union</code> function keeps the duplicate samples from two sets. This is equivalent to <code>UNION ALL</code> in SQL. By default, PySpark will not remove duplidates as it is an expensive process. If you want to drop duplicates, you have to do it explicitly.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Union and drop duplicates in PySpark</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">union</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">()</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</span></span></code></pre></div><pre tabindex="0"><code>9073900
</code></pre><p>Unioning can be useful when we are reading data from multiple files. We can read them one by one in a Python loop and union them.</p>
<p>Intersection is similar to Union. But, keep in mind that PySpark <code>intersect</code> is equivalent to SQL <code>INTERSECT</code>, not <code>INTERSECT ALL</code>.</p>
<h4 id="join">Join</h4>
<p>Very similar to Pandas, <code>DataFrame.join</code> method joins a <code>DataFrame</code> with another <code>DataFrame</code> using the given join expression.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">selectExpr</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;ABS(newBalanceOrig - oldBalanceOrig) changeOrig&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;changeOrig&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgChangeOrig&#39;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">F</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;occOrig&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;avgChangeOrig &gt; 100000&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Join the above DataFrame with the one provided in parameter</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">join</span><span class="p">((</span>
</span></span><span class="line"><span class="cl">        <span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="o">.</span><span class="n">selectExpr</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">,</span> <span class="s1">&#39;ABS(newBalanceDest - oldBalanceDest) changeDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">F</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;changeDest&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;avgChangeDest&#39;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">F</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;occDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;avgChangeDest &gt; 100000&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">),</span> <span class="n">on</span><span class="o">=</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;nameOrig&#39;</span><span class="p">)</span><span class="o">==</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;nameDest&#39;</span><span class="p">),</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;inner&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># There are several join method: inner, left, right, cross, outer, left_outer, right_outer, left_semi, left_anti, right_semi, right_anti, ...</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">selectExpr</span><span class="p">(</span><span class="s1">&#39;nameOrig name&#39;</span><span class="p">,</span> <span class="s1">&#39;occOrig + occDest occ&#39;</span><span class="p">,</span> <span class="s1">&#39;avgChangeOrig&#39;</span><span class="p">,</span> <span class="s1">&#39;avgChangeDest&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s1">&#39;occ&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">    SELECT nameOrig name, occOrig + occDest occ, avgChangeOrig, avgChangeDest
</span></span></span><span class="line"><span class="cl"><span class="s1">    FROM
</span></span></span><span class="line"><span class="cl"><span class="s1">    (
</span></span></span><span class="line"><span class="cl"><span class="s1">        SELECT nameOrig, AVG(ABS(newBalanceOrig - oldBalanceOrig)) avgChangeOrig, COUNT(*) occOrig
</span></span></span><span class="line"><span class="cl"><span class="s1">        FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">        WHERE type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;
</span></span></span><span class="line"><span class="cl"><span class="s1">        GROUP BY nameOrig
</span></span></span><span class="line"><span class="cl"><span class="s1">        HAVING avgChangeOrig &gt; 100000
</span></span></span><span class="line"><span class="cl"><span class="s1">    )
</span></span></span><span class="line"><span class="cl"><span class="s1">    INNER JOIN
</span></span></span><span class="line"><span class="cl"><span class="s1">    (
</span></span></span><span class="line"><span class="cl"><span class="s1">        SELECT nameDest, AVG(ABS(newBalanceDest - oldBalanceDest)) avgChangeDest, COUNT(*) occDest
</span></span></span><span class="line"><span class="cl"><span class="s1">        FROM df
</span></span></span><span class="line"><span class="cl"><span class="s1">        WHERE type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;
</span></span></span><span class="line"><span class="cl"><span class="s1">        GROUP BY nameDest
</span></span></span><span class="line"><span class="cl"><span class="s1">        HAVING avgChangeDest &gt; 100000
</span></span></span><span class="line"><span class="cl"><span class="s1">    )
</span></span></span><span class="line"><span class="cl"><span class="s1">    ON nameOrig = nameDest
</span></span></span><span class="line"><span class="cl"><span class="s1">    ORDER BY occ DESC
</span></span></span><span class="line"><span class="cl"><span class="s1">&#39;&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>+-----------+---+------------------+------------------+
|       name|occ|     avgChangeOrig|     avgChangeDest|
+-----------+---+------------------+------------------+
|C1552859894| 43|193711.30000000005| 763241.1652380949|
|C1819271729| 37|         278937.79|283626.17805555544|
|C1692434834| 37|177369.73000000045| 438853.7616666666|
| C889762313| 32|         132731.31|211437.18741935486|
|C1868986147| 32|         120594.03|249840.37709677417|
|  C55305556| 28|319860.45999999903|225565.42111111112|
| C636092700| 26|217273.86000000004|201888.05279999998|
|C1713505653| 25| 278622.8400000003|186625.34916666665|
|C2029542508| 24| 235760.1200000001|231022.98217391354|
| C699906968| 23| 177813.3799999999| 183054.3072727272|
+-----------+---+------------------+------------------+
only showing top 10 rows
</code></pre><p>In the above example, I demonstrated mixing Python Spark and SQL syntax for cleaner code. Instead of the verbose expression:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">((</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span><span class="o">==</span><span class="s1">&#39;CASH_IN&#39;</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s1">&#39;type&#39;</span><span class="p">)</span><span class="o">==</span><span class="s1">&#39;CASH_OUT&#39;</span><span class="p">))</span>
</span></span></code></pre></div><p>You can write:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;type = &#34;CASH_IN&#34; OR type = &#34;CASH_OUT&#34;&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>This style can be applied in various Python Spark functions: <code>selectExpr</code>, <code>where</code>, <code>filter</code>, <code>expr</code>,&hellip; Choose your preferred coding style – PySpark offers the flexibility.</p>
<h2 id="endnote">Endnote</h2>
<p>This tutorial has covered basic Spark operations in both Python and SQL syntax. You will be able to perform most common data transformation and analysis tasks. But your Spark journey doesn&rsquo;t end here! There are more advanced features that were not covered in this article (e.g., UDF). They will be discussed in <a href="../pyspark-udfs-a-comprehensive-guide-to-unlock-pyspark-potential/">another post</a> soon.</p>
]]></content:encoded></item></channel></rss>