<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Great Expectations on Denis Gontcharov</title>
    <link>https://gontcharov.eu/tags/great-expectations/</link>
    <description>Recent content in Great Expectations on Denis Gontcharov</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Thu, 20 Feb 2025 18:17:42 +0100</lastBuildDate><atom:link href="https://gontcharov.eu/tags/great-expectations/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Hosting Great Expectations Data Docs on Azure Blob Storage</title>
      <link>https://gontcharov.eu/posts/blog/great-expectations-azure/</link>
      <pubDate>Thu, 20 Feb 2025 18:17:42 +0100</pubDate>
      
      <guid>https://gontcharov.eu/posts/blog/great-expectations-azure/</guid>
      <description>&lt;h1 id=&#34;resources&#34;&gt;Resources&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Check out the complete code on &lt;a href=&#34;https://github.com/gontcharovd/great_expectations_azure&#34;&gt;GitHub&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Browse the GX Data Doc on &lt;a href=&#34;https://gxstorageacc.blob.core.windows.net/$web/index.html&#34;&gt;Azure Blob Storage&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&#34;use-case&#34;&gt;Use Case&lt;/h1&gt;
&lt;p&gt;Last week I &lt;a href=&#34;https://gontcharov.eu/posts/exploring-soda-data-quality-framework/#use-case&#34;&gt;explored Soda as a data quality testing framework&lt;/a&gt; for my large enterprise client. This week I&amp;rsquo;m exploring a more mature alternative called &lt;a href=&#34;https://greatexpectations.io/&#34;&gt;Great Expectations&lt;/a&gt; or GX in short.&lt;/p&gt;
&lt;p&gt;GX generates neat HTML reports called &lt;a href=&#34;https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/data_docs/&#34;&gt;Data Docs&lt;/a&gt; that give an overview of your data quality test results. The client wants to share these reports with the team - but not with the world! As the client is already using Azure, hosting the report files on Azure Blob Storage seems like a good solution.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="resources">Resources</h1>
<ul>
<li>Check out the complete code on <a href="https://github.com/gontcharovd/great_expectations_azure">GitHub</a>.</li>
<li>Browse the GX Data Doc on <a href="https://gxstorageacc.blob.core.windows.net/$web/index.html">Azure Blob Storage</a>.</li>
</ul>
<h1 id="use-case">Use Case</h1>
<p>Last week I <a href="https://gontcharov.eu/posts/exploring-soda-data-quality-framework/#use-case">explored Soda as a data quality testing framework</a> for my large enterprise client. This week I&rsquo;m exploring a more mature alternative called <a href="https://greatexpectations.io/">Great Expectations</a> or GX in short.</p>
<p>GX generates neat HTML reports called <a href="https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/data_docs/">Data Docs</a> that give an overview of your data quality test results. The client wants to share these reports with the team - but not with the world! As the client is already using Azure, hosting the report files on Azure Blob Storage seems like a good solution.</p>
<h1 id="why-azure-blob-storage">Why Azure Blob Storage?</h1>
<h2 id="1-easy-implementation">1. Easy Implementation</h2>
<p>Installing new solutions at enterprises is notoriously difficult. There&rsquo;s often a long procurement process and many budget-approval hoops to jump through. Because the client is already using Azure, it&rsquo;s only a small step to provision an additional Blob Container.</p>
<h2 id="2-familiar-access-control">2. Familiar Access Control</h2>
<p>As the Blob Container becomes part of the client&rsquo;s Azure ecosystem, the existing IT-team can easily manage access to the Data Docs using RBAC<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>. There are no new security-measures to implement.</p>
<h1 id="solution">Solution</h1>
<p>This solution is a direct implementation of the GX documentation about <a href="https://legacy.017.docs.greatexpectations.io/docs/0.16.16/guides/setup/configuring_data_docs/how_to_host_and_share_data_docs_on_azure_blob_storage/">how to host and share Data Docs on Azure Blob Storage</a>.</p>
<h2 id="sample-data">Sample Data</h2>
<p>The code defines two expectations for the following simple Pandas dataframe:</p>
<ol>
<li>The <code>NumericColumn</code> may only have values between <strong>0</strong> and <strong>90</strong>.</li>
<li>The <code>TextColumn</code> may only have values from <strong>&ldquo;Item 1&rdquo;</strong> to <strong>&ldquo;Item 10&rdquo;</strong>.</li>
</ol>
<pre tabindex="0"><code class="language-stdout" data-lang="stdout">   NumericColumn TextColumn
0             10     Item 1
1             20     Item 2
2             30     Item 3
3             40     Item 4
4             50     Item 5
5             60     Item 6
6             70     Item 7
7             80     Item 8
8             90     Item 9
9            100    Item 10
</code></pre><h2 id="code">Code</h2>
<p>I won&rsquo;t go into the same steps in detail here. Rather, I&rsquo;ll highlight a couple of important points:</p>
<ul>
<li>I configured the following <code>azure_blob_storage</code> site definition in my <a href="https://github.com/gontcharovd/great_expectations_azure/blob/1788ea3b2f1195d2290ff2c8a4c6f32b0702eb4b/gx/great_expectations.yml#L83">great_expectations.yml</a> file.</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="w">  </span><span class="nt">azure_blob_storage</span><span class="p">:</span><span class="w">  </span><span class="c"># this is a user-selected name - you can select your own</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">class_name</span><span class="p">:</span><span class="w"> </span><span class="l">SiteBuilder</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">store_backend</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="nt">class_name</span><span class="p">:</span><span class="w"> </span><span class="l">TupleAzureBlobStoreBackend</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="nt">container</span><span class="p">:</span><span class="w"> </span><span class="l">\$web</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="nt">connection_string</span><span class="p">:</span><span class="w"> </span><span class="l">${AZURE_STORAGE_CONNECTION_STRING}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">site_index_builder</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">class_name</span><span class="p">:</span><span class="w"> </span><span class="l">DefaultSiteIndexBuilder</span><span class="w">
</span></span></span></code></pre></div><ul>
<li>
<p>If you are running the [setup_gx.py] file for the first time, don&rsquo;t forget to set <code>do_config = True</code> and update the path in <code>CONTEXT_DIR</code> to your system.</p>
</li>
<li>
<p>Don&rsquo;t forget to set your <a href="https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string">Azure Blob Storage connection string</a> in the <code>connection_string</code> option. As this value is secret, it&rsquo;s not included in my repository.</p>
</li>
</ul>
<h1 id="result">Result</h1>
<p>I think the result looks pretty neat. The <em>index.html</em> along with the other files are created in a Blob Container <strong>$web</strong>:</p>
<p><img loading="lazy" src="/posts/blog/great-expectations-azure/container.png" type="" alt=""  /></p>
<p>The final result can be accessed by anyone on the internet <a href="https://gxstorageacc.blob.core.windows.net/$web/index.html#">here</a>. We see that five GX runs have been made, resulting in five Validation Results.</p>
<p><img loading="lazy" src="/posts/blog/great-expectations-azure/report.png" type="" alt=""  /></p>
<p>Note how the Expectation Suites tab gives more information about the Expectation Suites, in this case <strong>panda_expectations</strong>. This feature gives business users clear information about how the date they&rsquo;re using has been tested.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Overall, I quite like the Great Expectations framework so far. Comparing it to <a href="https://gontcharov.eu/posts/exploring-soda-data-quality-framework/#use-case">Soda Core</a> there&rsquo;s a couple of points I prefer about GX:</p>
<ol>
<li>GX has more open-source features, e.g. Data Docs.</li>
<li>Although GX is more convoluted, the organization of Expectations into Suites allows to maintain order as the project grows.</li>
<li>The community behind GX seems sufficiently active.</li>
<li>No date is shared outside of the company to access certain features, as opposed to Soda Cloud.</li>
</ol>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Read more about Role Based Access Control (RBAC) on Azure <a href="https://learn.microsoft.com/en-us/azure/role-based-access-control/overview">here</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
