<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Azure on Denis Gontcharov</title>
    <link>https://gontcharov.eu/tags/azure/</link>
    <description>Recent content in Azure on Denis Gontcharov</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Tue, 22 Jul 2025 11:43:12 +0200</lastBuildDate><atom:link href="https://gontcharov.eu/tags/azure/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>🎥 Deploying a Databricks Asset Bundle with Azure DevOps Pipelines</title>
      <link>https://gontcharov.eu/posts/youtube/databricks-dab-azure-devops-pipelines/</link>
      <pubDate>Tue, 22 Jul 2025 11:43:12 +0200</pubDate>
      
      <guid>https://gontcharov.eu/posts/youtube/databricks-dab-azure-devops-pipelines/</guid>
      <description>&lt;h1 id=&#34;video&#34;&gt;Video&lt;/h1&gt;
&lt;div style=&#34;position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;&#34;&gt;
      &lt;iframe allow=&#34;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share&#34; loading=&#34;eager&#34; referrerpolicy=&#34;strict-origin-when-cross-origin&#34; src=&#34;https://www.youtube.com/embed/jVxip1rm3SA?autoplay=0&amp;amp;controls=1&amp;amp;end=0&amp;amp;loop=0&amp;amp;mute=0&amp;amp;start=0&#34; style=&#34;position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;&#34; title=&#34;YouTube video&#34;&gt;&lt;/iframe&gt;
    &lt;/div&gt;

&lt;h1 id=&#34;objectives&#34;&gt;Objectives&lt;/h1&gt;
&lt;p&gt;In this post we will deploy a Databricks Asset Bundle or DAB from a Git repository hosted on Azure DevOps using Azure DevOps pipelines. In summary, we will learn how to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Grant Databricks access to your Azure DevOps Git repository.&lt;/li&gt;
&lt;li&gt;Define a simple DAB that deploys a Databricks notebook.&lt;/li&gt;
&lt;li&gt;Learn how to use the Databricks CLI to validate and deploy DABs.&lt;/li&gt;
&lt;li&gt;Write a Azure DevOps pipeline to deploy this DAB.&lt;/li&gt;
&lt;li&gt;Pass parameters from the DAB into the Databricks notebook.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Concerning the last point, it&amp;rsquo;s not uncommon that your code differs slightly in each Databricks environment (dev, test, prod). For example, you may have an Azure key vault &lt;code&gt;my_key_vault_dev&lt;/code&gt; for the development workspace and &lt;code&gt;my_key_vault_prod&lt;/code&gt; for the production workspace. We will see how to pass this workspace-dependent data from the DAB to Databricks Notebooks via widgets.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="video">Video</h1>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/jVxip1rm3SA?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<h1 id="objectives">Objectives</h1>
<p>In this post we will deploy a Databricks Asset Bundle or DAB from a Git repository hosted on Azure DevOps using Azure DevOps pipelines. In summary, we will learn how to:</p>
<ul>
<li>Grant Databricks access to your Azure DevOps Git repository.</li>
<li>Define a simple DAB that deploys a Databricks notebook.</li>
<li>Learn how to use the Databricks CLI to validate and deploy DABs.</li>
<li>Write a Azure DevOps pipeline to deploy this DAB.</li>
<li>Pass parameters from the DAB into the Databricks notebook.</li>
</ul>
<p>Concerning the last point, it&rsquo;s not uncommon that your code differs slightly in each Databricks environment (dev, test, prod). For example, you may have an Azure key vault <code>my_key_vault_dev</code> for the development workspace and <code>my_key_vault_prod</code> for the production workspace. We will see how to pass this workspace-dependent data from the DAB to Databricks Notebooks via widgets.</p>
<h1 id="project-overview">Project Overview</h1>
<p>The project directory in the Git repository consists of just three files and a README:</p>
<pre tabindex="0"><code class="language-stdout" data-lang="stdout">.
├── README.md --&gt; Documentation
├── azure_devops_pipeline.yml --&gt; Azure DevOps pipeline YAML file
├── databricks.yml --&gt; The DAB YAML file with a notebook task
└── demo_notebook.ipynb --&gt; The minimal Databricks notebook
</code></pre><p>On a high level, we define a Databricks notebook. This notebook will be executed as part of a Databricks job defined in the DAB. This DAB will be automatically deployed to our Databricks workspace using the Azure DevOps Pipeline.</p>
<h1 id="databricks-notebook">Databricks Notebook</h1>
<p>The notebook that is executed by the workflow consists of just two lines:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">value</span> <span class="o">=</span> <span class="n">dbutils</span><span class="o">.</span><span class="n">widgets</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;demo_parameter&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">value</span><span class="p">)</span>
</span></span></code></pre></div><p>We simply read and print a value from a <a href="https://docs.databricks.com/aws/en/jobs/parameter-use">Databricks notebook parameter</a>. This value is set in the DAB file, and can therefore differ for each environment (e.g. development, test, production). For example, the <code>git_branch</code> for our hypothetical <em>&ldquo;dev&rdquo;</em> environment could be <em>&ldquo;develop&rdquo;</em>.</p>
<h1 id="databricks-asset-bundle-dab">Databricks Asset Bundle (DAB)</h1>
<p>Having defined the notebook above, we now define a Databricks job that executes the notebook as a notebook task.</p>
<h2 id="databricks-asset-bundle-yaml">Databricks Asset Bundle YAML</h2>
<p>The code below defines the Databricks job. Pay attention to the following important elements:</p>
<ol>
<li>The DAB defines two variables <code>git_branch</code> and <code>demo_parameter_value</code>. The value for these two variables is defined in the target <code>free</code><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</li>
<li>We define a text parameter <code>demo_parameter</code> for the notebook and assign it a value via <code>${var.demo_parameter_value}</code> by referring to the variable created in the previous point.</li>
<li>We use the <code>git_branch</code> parameter from the previous point to pull the code from the head of the main branch (instead of a Databricks workspace). The <code>git_url</code> points to our Azure DevOps Git repository<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>.</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">bundle</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;DAB-Demo&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">uuid</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;05622722-fb3a-4a17-8f1f-c3c1d37ececb&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">variables</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">git_branch</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Git branch to use for job source code&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">demo_parameter_value</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Text value to pass as a Databricks notebook parameter&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">presets</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">tags</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">application</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Demo Notebook&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">targets</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">free</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">mode</span><span class="p">:</span><span class="w"> </span><span class="l">development</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">workspace</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">host</span><span class="p">:</span><span class="w"> </span><span class="l">https://dbc-e667f434-e97e.cloud.databricks.com</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">variables</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">git_branch</span><span class="p">:</span><span class="w"> </span><span class="l">main</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">demo_parameter_value</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Hello, World!&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">jobs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">run_demo_notebook</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">run_demo_notebook_job</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">tasks</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span>- <span class="nt">task_key</span><span class="p">:</span><span class="w"> </span><span class="l">run_demo_notebook_task</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">notebook_task</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">notebook_path</span><span class="p">:</span><span class="w"> </span><span class="l">demo_notebook</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">base_parameters</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">demo_parameter</span><span class="p">:</span><span class="w"> </span><span class="l">${var.demo_parameter_value}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">source</span><span class="p">:</span><span class="w"> </span><span class="l">GIT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">git_source</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">git_url</span><span class="p">:</span><span class="w"> </span><span class="l">https://gontcharovd@dev.azure.com/gontcharovd/databricks-dab-demo/_git/databricks-dab-demo</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">git_provider</span><span class="p">:</span><span class="w"> </span><span class="l">azureDevOpsServices</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">git_branch</span><span class="p">:</span><span class="w"> </span><span class="l">${var.git_branch}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">schedule</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">quartz_cron_expression</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;0 0 7 * * ?&#34;</span><span class="w">  </span><span class="c"># Daily at 7:00 AM UTC</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">timezone_id</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;UTC&#34;</span><span class="w">
</span></span></span></code></pre></div><h2 id="authorize-databricks-to-pull-code-from-azure-devops-repo">Authorize Databricks to pull code from Azure DevOps repo</h2>
<p>Databricks needs to authenticate with Azure DevOps to pull the Git repository in the workspace. This requires creating a Personal Access Token (PAT) in Azure DevOps.</p>
<p>In Azure DevOps, navigate to &ldquo;user settings&rdquo; in the top-right corner (next to your account profile picture). Click on &ldquo;Personal access tokens&rdquo;. Create a new token with read/write access for Code for your organization or project. Copy the value.</p>
<p>In Databricks, click on your account profile picture in the top-right corner. Go to &ldquo;Settings&rdquo; and to &ldquo;Linked accounts&rdquo;. Click on &ldquo;Add Git credential&rdquo;. Fill out the fields (picture below) and paste the PAT value copied in earlier.</p>
<p><img loading="lazy" src="/posts/youtube/databricks-dab-azure-devops-pipelines/authentication.png" type="" alt=""  /></p>
<h2 id="manual-dab-deployment">Manual DAB Deployment</h2>
<p>Now that we have defined the DAB and authorized Databricks to access our Azure DevOps repo, we can deploy the DAB and run the created Databricks job. As a first step, we will deploy manually using the <a href="https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/install">Databricks CLI</a>.</p>
<p>After installation, login and create a profile &ldquo;free&rdquo;. Replace the <code>host</code> URL with the correct link to your Databricks (free) workspace.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">databricks auth login
</span></span></code></pre></div><p>Let&rsquo;s validate the bundle:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">databricks bundle validate -t free
</span></span></code></pre></div><p>Output:</p>
<pre tabindex="0"><code class="language-stdout" data-lang="stdout">Name: DAB-Demo
Target: free
Workspace:
  Host: https://dbc-e667f434-e97e.cloud.databricks.com
  User: denis@gontcharov.eu
  Path: /Workspace/Users/denis@gontcharov.eu/.bundle/DAB-Demo/free

Validation OK!
</code></pre><p>Everything looks good. Let&rsquo;s deploy the bundle:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">databricks bundle deploy -t free
</span></span></code></pre></div><p>Output:</p>
<pre tabindex="0"><code class="language-stdout" data-lang="stdout">Uploading bundle files to /Workspace/Users/denis@gontcharov.eu/.bundle/DAB-Demo/free/files...
Deploying resources...
Updating deployment state...
Deployment complete!
</code></pre><h2 id="running-the-workflow">Running the workflow</h2>
<p>We can see the final workflow in the Jobs &amp; Pipelines view in the Databricks UI:</p>
<p><img loading="lazy" src="/posts/youtube/databricks-dab-azure-devops-pipelines/workflow.png" type="" alt=""  /></p>
<p>Click on the &ldquo;Play&rdquo; button to execute the job:</p>
<p><img loading="lazy" src="/posts/youtube/databricks-dab-azure-devops-pipelines/notebook_run.png" type="" alt=""  /></p>
<p>Notice how the value <em>&ldquo;Hello, World!&rdquo;</em> came from the DAB file.</p>
<h1 id="azure-devops-pipeline">Azure DevOps Pipeline</h1>
<p>Now that we verified that manual deployment works, we want to automate the deployment process. Concretely, we want to redeploy the DAB whenever a commit/merge is made on the main branch. This is accomplished by a Azure DevOps pipelines that we will configure in the next part.</p>
<h2 id="pipeline-yaml">Pipeline YAML</h2>
<p>The code below defines the Azure DevOps pipeline that deploys the resources defined in the DAB to the &ldquo;free&rdquo; target. Notice the following points:</p>
<ol>
<li>The pipeline is triggered whenever a change to the files <em>demo_notebook.ipynb</em>, <em>databricks.yaml</em>, or <em>azure_devops_pipeline.yml</em> on the <code>main</code> branch is made.</li>
<li>The <code>condition</code> statement is important to trigger a particular job for a particular branch.</li>
<li>The job steps rely on two variables <code>DATABRICKS_TOKEN</code> and <code>DATABRICKS_WORKSPACE</code> defined in the <code>databricks-free-variables-group</code>. We will define these variables later.</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yml" data-lang="yml"><span class="line"><span class="cl"><span class="nt">trigger</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">branches</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">include</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">main</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">paths</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">include</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="l">demo_notebook.ipynb</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="l">databricks.yml</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="l">azure_devops_pipeline.yml</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">jobs</span><span class="p">:</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">job</span><span class="p">:</span><span class="w"> </span><span class="l">DeployFree</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">displayName</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Deploy to free Databricks workspace&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">condition</span><span class="p">:</span><span class="w"> </span><span class="l">eq(variables[&#39;Build.SourceBranch&#39;], &#39;refs/heads/main&#39;)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">variables</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">group</span><span class="p">:</span><span class="w"> </span><span class="l">databricks-free-variable-group</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">steps</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">script</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">          curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">displayName</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;Install Databricks CLI&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">task</span><span class="p">:</span><span class="w"> </span><span class="l">Bash@3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">displayName</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;Validate Databricks Bundle for $(DATABRICKS_WORKSPACE)&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">inputs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">targetType</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;inline&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">script</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">            export DATABRICKS_TOKEN=&#34;$(DATABRICKS_TOKEN)&#34;
</span></span></span><span class="line"><span class="cl"><span class="sd">            databricks bundle validate -t $(DATABRICKS_WORKSPACE)
</span></span></span><span class="line"><span class="cl"><span class="sd">            </span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">task</span><span class="p">:</span><span class="w"> </span><span class="l">Bash@3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">displayName</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;Deploy Databricks Bundle to $(DATABRICKS_WORKSPACE)&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">inputs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">targetType</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;inline&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">script</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">            export DATABRICKS_TOKEN=&#34;$(DATABRICKS_TOKEN)&#34;
</span></span></span><span class="line"><span class="cl"><span class="sd">            databricks bundle deploy -t $(DATABRICKS_WORKSPACE)</span><span class="w">
</span></span></span></code></pre></div><p>The job consists of three steps:</p>
<ol>
<li>First we install the Databricks CLI on the Azure DevOps pipeline agent that runs the job.</li>
<li>We then validate the DAB like we did manually in the previous part.</li>
<li>Finally, we use the same command that we ran manually in the previous part to deploy the DAB.</li>
</ol>
<p>Note that the Databricks CLI authentication takes place using the environment variable <code>DATABRICKS_TOKEN</code>. We specify the target using the <code>-t</code> flag and the variable <code>DATABRICKS_WORKSPACE</code>. Make sure to push this code to your Azure DevOps repository.</p>
<h2 id="authorize-azure-devops-to-deploy-dabs">Authorize Azure DevOps to deploy DABs</h2>
<p>Remember how we had to authorize Databricks to access Azure DevOps Repos? Now we have to do the same but in the opposite direction: Azure DevOps needs to be authorized to deploy DABs in our Databricks workspace. This requires creating a Databricks PAT and storing it in Azure DevOps.</p>
<p>Go to the Databricks UI and create a Databricks PAT by clicking on your user profile picture in the top right corner. Click on &ldquo;settings&rdquo;, go to &ldquo;Developer&rdquo; and click on &ldquo;Manage&rdquo; under Access Tokens. Generate a new token and copy the value.</p>
<p>Navigate to Azure DevOps and open the &ldquo;Pipelines&rdquo; tab. Go to &ldquo;Library&rdquo; and create a new variable group <code>databricks-free-variable-group</code>. Create a new secret variable <code>DATABRICKS_TOKEN</code> and paste the copied PAT value. Create a second (non-secret) variable <code>DATABRICKS_WORKSPACE</code> and write the value &ldquo;free&rdquo;. This will be the target Databricks workspace in which we will deploy the DAB resources.</p>
<h2 id="creating-the-azure-devops-pipeline">Creating the Azure DevOps Pipeline</h2>
<p>Pushing the pipeline YAML code to the Azure DevOps repo is not sufficient. We have to manually create the pipeline.</p>
<p>In Azure DevOps, navigate back to the &ldquo;Pipeline&rdquo; tab. Click the &ldquo;Create Pipeline&rdquo; button. Select &ldquo;Azure Repos&rdquo; and choose &ldquo;Existing Azure Pipelines YAML file&rdquo;. Select the YAML-file containing your Azure DevOps pipeline code.</p>
<h2 id="running-the-azure-devops-pipeline">Running the Azure DevOps Pipeline</h2>
<p>Navigate to the newly created pipeline and click on &ldquo;Run pipeline&rdquo;. When you run the pipeline the first time, it will request permissions to use the variable group. Click on &ldquo;Permit&rdquo;. We see that the three steps of the job completed successfully:</p>
<p><img loading="lazy" src="/posts/youtube/databricks-dab-azure-devops-pipelines/pipeline.png" type="" alt=""  /></p>
<p>That&rsquo;s it! We can now make changes to our pipeline, push our changes to the remote repository, and automatically see them in the Databricks UI.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><a href="https://docs.databricks.com/aws/en/getting-started/free-edition">Databricks Free Edition</a> only allows one environment (that we call free). In a real application, we would define multiple targets, e.g. dev, test, and prod.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Even though this code is shared as a GitHub repository, the Azure DevOps pipeline will only work with an Azure DevOps Repo. You must create this repo yourself.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    
    <item>
      <title>Hosting Great Expectations Data Docs on Azure Blob Storage</title>
      <link>https://gontcharov.eu/posts/blog/great-expectations-azure/</link>
      <pubDate>Thu, 20 Feb 2025 18:17:42 +0100</pubDate>
      
      <guid>https://gontcharov.eu/posts/blog/great-expectations-azure/</guid>
      <description>&lt;h1 id=&#34;resources&#34;&gt;Resources&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Check out the complete code on &lt;a href=&#34;https://github.com/gontcharovd/great_expectations_azure&#34;&gt;GitHub&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Browse the GX Data Doc on &lt;a href=&#34;https://gxstorageacc.blob.core.windows.net/$web/index.html&#34;&gt;Azure Blob Storage&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&#34;use-case&#34;&gt;Use Case&lt;/h1&gt;
&lt;p&gt;Last week I &lt;a href=&#34;https://gontcharov.eu/posts/exploring-soda-data-quality-framework/#use-case&#34;&gt;explored Soda as a data quality testing framework&lt;/a&gt; for my large enterprise client. This week I&amp;rsquo;m exploring a more mature alternative called &lt;a href=&#34;https://greatexpectations.io/&#34;&gt;Great Expectations&lt;/a&gt; or GX in short.&lt;/p&gt;
&lt;p&gt;GX generates neat HTML reports called &lt;a href=&#34;https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/data_docs/&#34;&gt;Data Docs&lt;/a&gt; that give an overview of your data quality test results. The client wants to share these reports with the team - but not with the world! As the client is already using Azure, hosting the report files on Azure Blob Storage seems like a good solution.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="resources">Resources</h1>
<ul>
<li>Check out the complete code on <a href="https://github.com/gontcharovd/great_expectations_azure">GitHub</a>.</li>
<li>Browse the GX Data Doc on <a href="https://gxstorageacc.blob.core.windows.net/$web/index.html">Azure Blob Storage</a>.</li>
</ul>
<h1 id="use-case">Use Case</h1>
<p>Last week I <a href="https://gontcharov.eu/posts/exploring-soda-data-quality-framework/#use-case">explored Soda as a data quality testing framework</a> for my large enterprise client. This week I&rsquo;m exploring a more mature alternative called <a href="https://greatexpectations.io/">Great Expectations</a> or GX in short.</p>
<p>GX generates neat HTML reports called <a href="https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/data_docs/">Data Docs</a> that give an overview of your data quality test results. The client wants to share these reports with the team - but not with the world! As the client is already using Azure, hosting the report files on Azure Blob Storage seems like a good solution.</p>
<h1 id="why-azure-blob-storage">Why Azure Blob Storage?</h1>
<h2 id="1-easy-implementation">1. Easy Implementation</h2>
<p>Installing new solutions at enterprises is notoriously difficult. There&rsquo;s often a long procurement process and many budget-approval hoops to jump through. Because the client is already using Azure, it&rsquo;s only a small step to provision an additional Blob Container.</p>
<h2 id="2-familiar-access-control">2. Familiar Access Control</h2>
<p>As the Blob Container becomes part of the client&rsquo;s Azure ecosystem, the existing IT-team can easily manage access to the Data Docs using RBAC<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>. There are no new security-measures to implement.</p>
<h1 id="solution">Solution</h1>
<p>This solution is a direct implementation of the GX documentation about <a href="https://legacy.017.docs.greatexpectations.io/docs/0.16.16/guides/setup/configuring_data_docs/how_to_host_and_share_data_docs_on_azure_blob_storage/">how to host and share Data Docs on Azure Blob Storage</a>.</p>
<h2 id="sample-data">Sample Data</h2>
<p>The code defines two expectations for the following simple Pandas dataframe:</p>
<ol>
<li>The <code>NumericColumn</code> may only have values between <strong>0</strong> and <strong>90</strong>.</li>
<li>The <code>TextColumn</code> may only have values from <strong>&ldquo;Item 1&rdquo;</strong> to <strong>&ldquo;Item 10&rdquo;</strong>.</li>
</ol>
<pre tabindex="0"><code class="language-stdout" data-lang="stdout">   NumericColumn TextColumn
0             10     Item 1
1             20     Item 2
2             30     Item 3
3             40     Item 4
4             50     Item 5
5             60     Item 6
6             70     Item 7
7             80     Item 8
8             90     Item 9
9            100    Item 10
</code></pre><h2 id="code">Code</h2>
<p>I won&rsquo;t go into the same steps in detail here. Rather, I&rsquo;ll highlight a couple of important points:</p>
<ul>
<li>I configured the following <code>azure_blob_storage</code> site definition in my <a href="https://github.com/gontcharovd/great_expectations_azure/blob/1788ea3b2f1195d2290ff2c8a4c6f32b0702eb4b/gx/great_expectations.yml#L83">great_expectations.yml</a> file.</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="w">  </span><span class="nt">azure_blob_storage</span><span class="p">:</span><span class="w">  </span><span class="c"># this is a user-selected name - you can select your own</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">class_name</span><span class="p">:</span><span class="w"> </span><span class="l">SiteBuilder</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">store_backend</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="nt">class_name</span><span class="p">:</span><span class="w"> </span><span class="l">TupleAzureBlobStoreBackend</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="nt">container</span><span class="p">:</span><span class="w"> </span><span class="l">\$web</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="nt">connection_string</span><span class="p">:</span><span class="w"> </span><span class="l">${AZURE_STORAGE_CONNECTION_STRING}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">site_index_builder</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">class_name</span><span class="p">:</span><span class="w"> </span><span class="l">DefaultSiteIndexBuilder</span><span class="w">
</span></span></span></code></pre></div><ul>
<li>
<p>If you are running the [setup_gx.py] file for the first time, don&rsquo;t forget to set <code>do_config = True</code> and update the path in <code>CONTEXT_DIR</code> to your system.</p>
</li>
<li>
<p>Don&rsquo;t forget to set your <a href="https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string">Azure Blob Storage connection string</a> in the <code>connection_string</code> option. As this value is secret, it&rsquo;s not included in my repository.</p>
</li>
</ul>
<h1 id="result">Result</h1>
<p>I think the result looks pretty neat. The <em>index.html</em> along with the other files are created in a Blob Container <strong>$web</strong>:</p>
<p><img loading="lazy" src="/posts/blog/great-expectations-azure/container.png" type="" alt=""  /></p>
<p>The final result can be accessed by anyone on the internet <a href="https://gxstorageacc.blob.core.windows.net/$web/index.html#">here</a>. We see that five GX runs have been made, resulting in five Validation Results.</p>
<p><img loading="lazy" src="/posts/blog/great-expectations-azure/report.png" type="" alt=""  /></p>
<p>Note how the Expectation Suites tab gives more information about the Expectation Suites, in this case <strong>panda_expectations</strong>. This feature gives business users clear information about how the date they&rsquo;re using has been tested.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Overall, I quite like the Great Expectations framework so far. Comparing it to <a href="https://gontcharov.eu/posts/exploring-soda-data-quality-framework/#use-case">Soda Core</a> there&rsquo;s a couple of points I prefer about GX:</p>
<ol>
<li>GX has more open-source features, e.g. Data Docs.</li>
<li>Although GX is more convoluted, the organization of Expectations into Suites allows to maintain order as the project grows.</li>
<li>The community behind GX seems sufficiently active.</li>
<li>No date is shared outside of the company to access certain features, as opposed to Soda Cloud.</li>
</ol>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Read more about Role Based Access Control (RBAC) on Azure <a href="https://learn.microsoft.com/en-us/azure/role-based-access-control/overview">here</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
