<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Data Quality on Denis Gontcharov</title>
    <link>https://gontcharov.eu/tags/data-quality/</link>
    <description>Recent content in Data Quality on Denis Gontcharov</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Tue, 04 Nov 2025 17:16:58 +0100</lastBuildDate><atom:link href="https://gontcharov.eu/tags/data-quality/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>🎥 Watts in Your Data Podcast E7: Datatude with Jim Gavigan</title>
      <link>https://gontcharov.eu/posts/podcast/e07-jim-gavigan/</link>
      <pubDate>Tue, 04 Nov 2025 17:16:58 +0100</pubDate>
      
      <guid>https://gontcharov.eu/posts/podcast/e07-jim-gavigan/</guid>
      <description>&lt;p&gt;In this episode, Denis sits down with Jim Gavigan, founder of Industrial Insight, to discuss Datatude, a framework for measuring your organization&amp;rsquo;s readiness to leverage industrial data effectively.&lt;/p&gt;
&lt;div style=&#34;position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;&#34;&gt;
      &lt;iframe allow=&#34;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share&#34; loading=&#34;eager&#34; referrerpolicy=&#34;strict-origin-when-cross-origin&#34; src=&#34;https://www.youtube.com/embed/uL8asFIKxLo?autoplay=0&amp;amp;controls=1&amp;amp;end=0&amp;amp;loop=0&amp;amp;mute=0&amp;amp;start=0&#34; style=&#34;position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;&#34; title=&#34;YouTube video&#34;&gt;&lt;/iframe&gt;
    &lt;/div&gt;

&lt;h1 id=&#34;about-the-guest&#34;&gt;About the Guest:&lt;/h1&gt;
&lt;p&gt;Jim Gavigan brings 30 years of experience in industrial manufacturing, from vibration analysis and control systems to working at Rockwell Automation and OSIsoft. He founded Industrial Insight in 2016 to help companies maximize the value of their time series data.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<p>In this episode, Denis sits down with Jim Gavigan, founder of Industrial Insight, to discuss Datatude, a framework for measuring your organization&rsquo;s readiness to leverage industrial data effectively.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/uL8asFIKxLo?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<h1 id="about-the-guest">About the Guest:</h1>
<p>Jim Gavigan brings 30 years of experience in industrial manufacturing, from vibration analysis and control systems to working at Rockwell Automation and OSIsoft. He founded Industrial Insight in 2016 to help companies maximize the value of their time series data.</p>
<h1 id="key-topics">Key Topics:</h1>
<ul>
<li>What is Datatude and why it matters</li>
<li>The five dimensions: Data, Technology, People, Priorities, and Culture</li>
<li>Why companies struggle to build sophisticated analytics on poor foundations</li>
<li>The importance of starting small with concrete, achievable projects</li>
<li>Common pitfalls: prioritizing technology over people and process</li>
<li>How to scale data initiatives across multiple plants</li>
<li>Building the right team and culture for data success</li>
</ul>
<h1 id="key-takeaway">Key Takeaway:</h1>
<p>Stop trying to implement advanced AI and analytics on crappy data. Focus on getting the basics right first: clean data, proper documentation, the right people, and a culture that supports data-driven decisions.</p>
<h1 id="connect-with-jim">Connect with Jim:</h1>
<ul>
<li><a href="https://www.linkedin.com/in/jimgavigan/">LinkedIn</a></li>
<li><a href="https://www.industrialinsightinc.com/">Website</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>🎧 Watts in Your Data Podcast E3: Industrial Time-Series Data Quality and Reliability with Timeseer</title>
      <link>https://gontcharov.eu/posts/podcast/e03-timeseer/</link>
      <pubDate>Tue, 06 May 2025 11:28:43 +0200</pubDate>
      
      <guid>https://gontcharov.eu/posts/podcast/e03-timeseer/</guid>
      <description>&lt;p&gt;Data good enough for operations is not necessarily analytics-ready.&lt;/p&gt;
&lt;p&gt;Exactly one month ago I published the first episode of my Watts in Your Data Podcast. In my opinion, a topic of critical importance that is all too often overlooked, especially with all the buzz around AI.&lt;/p&gt;
&lt;p&gt;In the most recent episode, I had the pleasure of inviting guest speaker Thomas Dhollander, co-founder of Timeseer.AI. Together we explored critical challenges in industrial time series data reliability and observability.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<p>Data good enough for operations is not necessarily analytics-ready.</p>
<p>Exactly one month ago I published the first episode of my Watts in Your Data Podcast. In my opinion, a topic of critical importance that is all too often overlooked, especially with all the buzz around AI.</p>
<p>In the most recent episode, I had the pleasure of inviting guest speaker Thomas Dhollander, co-founder of Timeseer.AI. Together we explored critical challenges in industrial time series data reliability and observability.</p>
<iframe width="100%" height="180" frameborder="no" scrolling="no" seamless="" src="https://share.transistor.fm/e/7cc3dc6a"></iframe>
<h1 id="key-takeaways">Key takeaways:</h1>
<ul>
<li>Data quality is not just a technical issue — it&rsquo;s a people and process problem, deeply tied to governance and ownership.</li>
<li>Data management at many companies is still reactive — fixing issues only after models break or KPIs look suspicious. When companies scale their data-driven operations, they need to turn to proactive data management to avoid ending up in firefighting mode.</li>
<li>Data maturity varies by company and by industry — utilities and pharma often lead, some other industries may still view data as a byproduct.</li>
<li>Data should be treated like a product — with quality checks, documentation, and accountability — especially as you scale analytics. This is also true for OT data.</li>
<li>AI needs data quality — ML and AI depend on quality inputs and sensor drift or misconfigured tags can quietly corrupt your entire model output. Interestingly, AI is also a key enabler in scaling data quality.</li>
<li>Moving data to the cloud introduces new risks — missing context, inconsistent pipelines, and ownership confusion.</li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>🎧 Watts in Your Data Podcast E1: Introduction</title>
      <link>https://gontcharov.eu/posts/podcast/e01-introduction/</link>
      <pubDate>Tue, 01 Apr 2025 16:08:03 +0200</pubDate>
      
      <guid>https://gontcharov.eu/posts/podcast/e01-introduction/</guid>
      <description>&lt;iframe width=&#34;100%&#34; height=&#34;180&#34; frameborder=&#34;no&#34; scrolling=&#34;no&#34; seamless=&#34;&#34; src=&#34;https://share.transistor.fm/e/dc4005d2&#34;&gt;&lt;/iframe&gt;
&lt;p&gt;Welcome to my podcast! In this very first episode I introduce the topics of this podcast and explain my background in data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://gontcharovd.transistor.fm/&#34;&gt;Follow the show&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://gontcharov.eu/&#34;&gt;About Denis Gontcharov&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
      <content:encoded><![CDATA[<iframe width="100%" height="180" frameborder="no" scrolling="no" seamless="" src="https://share.transistor.fm/e/dc4005d2"></iframe>
<p>Welcome to my podcast! In this very first episode I introduce the topics of this podcast and explain my background in data.</p>
<ul>
<li><a href="https://gontcharovd.transistor.fm/">Follow the show</a></li>
<li><a href="https://gontcharov.eu/">About Denis Gontcharov</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>🎥 Testing Data Quality with Soda Core in Databricks</title>
      <link>https://gontcharov.eu/posts/youtube/databricks-soda-core/</link>
      <pubDate>Sat, 29 Mar 2025 11:13:27 +0100</pubDate>
      
      <guid>https://gontcharov.eu/posts/youtube/databricks-soda-core/</guid>
      <description>&lt;p&gt;In this video I demonstrate how to perform data quality checks on a Delta table in Databricks using &lt;a href=&#34;https://docs.soda.io/soda-core/overview-main.html&#34;&gt;Soda Core&lt;/a&gt;.&lt;/p&gt;
&lt;div style=&#34;position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;&#34;&gt;
      &lt;iframe allow=&#34;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share&#34; loading=&#34;eager&#34; referrerpolicy=&#34;strict-origin-when-cross-origin&#34; src=&#34;https://www.youtube.com/embed/cyasTwPdZEs?autoplay=0&amp;amp;controls=1&amp;amp;end=0&amp;amp;loop=0&amp;amp;mute=0&amp;amp;start=0&#34; style=&#34;position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;&#34; title=&#34;YouTube video&#34;&gt;&lt;/iframe&gt;
    &lt;/div&gt;

&lt;br&gt;
&lt;p&gt;Soda Core is the open-source Python package developed by Soda. It can be compared to Great Expectations, but is much simpler in my opinion. I enjoy using Soda in my professional projects and will continue exploring this framework.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<p>In this video I demonstrate how to perform data quality checks on a Delta table in Databricks using <a href="https://docs.soda.io/soda-core/overview-main.html">Soda Core</a>.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/cyasTwPdZEs?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<br>
<p>Soda Core is the open-source Python package developed by Soda. It can be compared to Great Expectations, but is much simpler in my opinion. I enjoy using Soda in my professional projects and will continue exploring this framework.</p>
]]></content:encoded>
    </item>
    
    <item>
      <title>Hosting Great Expectations Data Docs on Azure Blob Storage</title>
      <link>https://gontcharov.eu/posts/blog/great-expectations-azure/</link>
      <pubDate>Thu, 20 Feb 2025 18:17:42 +0100</pubDate>
      
      <guid>https://gontcharov.eu/posts/blog/great-expectations-azure/</guid>
      <description>&lt;h1 id=&#34;resources&#34;&gt;Resources&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Check out the complete code on &lt;a href=&#34;https://github.com/gontcharovd/great_expectations_azure&#34;&gt;GitHub&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Browse the GX Data Doc on &lt;a href=&#34;https://gxstorageacc.blob.core.windows.net/$web/index.html&#34;&gt;Azure Blob Storage&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&#34;use-case&#34;&gt;Use Case&lt;/h1&gt;
&lt;p&gt;Last week I &lt;a href=&#34;https://gontcharov.eu/posts/exploring-soda-data-quality-framework/#use-case&#34;&gt;explored Soda as a data quality testing framework&lt;/a&gt; for my large enterprise client. This week I&amp;rsquo;m exploring a more mature alternative called &lt;a href=&#34;https://greatexpectations.io/&#34;&gt;Great Expectations&lt;/a&gt; or GX in short.&lt;/p&gt;
&lt;p&gt;GX generates neat HTML reports called &lt;a href=&#34;https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/data_docs/&#34;&gt;Data Docs&lt;/a&gt; that give an overview of your data quality test results. The client wants to share these reports with the team - but not with the world! As the client is already using Azure, hosting the report files on Azure Blob Storage seems like a good solution.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="resources">Resources</h1>
<ul>
<li>Check out the complete code on <a href="https://github.com/gontcharovd/great_expectations_azure">GitHub</a>.</li>
<li>Browse the GX Data Doc on <a href="https://gxstorageacc.blob.core.windows.net/$web/index.html">Azure Blob Storage</a>.</li>
</ul>
<h1 id="use-case">Use Case</h1>
<p>Last week I <a href="https://gontcharov.eu/posts/exploring-soda-data-quality-framework/#use-case">explored Soda as a data quality testing framework</a> for my large enterprise client. This week I&rsquo;m exploring a more mature alternative called <a href="https://greatexpectations.io/">Great Expectations</a> or GX in short.</p>
<p>GX generates neat HTML reports called <a href="https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/data_docs/">Data Docs</a> that give an overview of your data quality test results. The client wants to share these reports with the team - but not with the world! As the client is already using Azure, hosting the report files on Azure Blob Storage seems like a good solution.</p>
<h1 id="why-azure-blob-storage">Why Azure Blob Storage?</h1>
<h2 id="1-easy-implementation">1. Easy Implementation</h2>
<p>Installing new solutions at enterprises is notoriously difficult. There&rsquo;s often a long procurement process and many budget-approval hoops to jump through. Because the client is already using Azure, it&rsquo;s only a small step to provision an additional Blob Container.</p>
<h2 id="2-familiar-access-control">2. Familiar Access Control</h2>
<p>As the Blob Container becomes part of the client&rsquo;s Azure ecosystem, the existing IT-team can easily manage access to the Data Docs using RBAC<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>. There are no new security-measures to implement.</p>
<h1 id="solution">Solution</h1>
<p>This solution is a direct implementation of the GX documentation about <a href="https://legacy.017.docs.greatexpectations.io/docs/0.16.16/guides/setup/configuring_data_docs/how_to_host_and_share_data_docs_on_azure_blob_storage/">how to host and share Data Docs on Azure Blob Storage</a>.</p>
<h2 id="sample-data">Sample Data</h2>
<p>The code defines two expectations for the following simple Pandas dataframe:</p>
<ol>
<li>The <code>NumericColumn</code> may only have values between <strong>0</strong> and <strong>90</strong>.</li>
<li>The <code>TextColumn</code> may only have values from <strong>&ldquo;Item 1&rdquo;</strong> to <strong>&ldquo;Item 10&rdquo;</strong>.</li>
</ol>
<pre tabindex="0"><code class="language-stdout" data-lang="stdout">   NumericColumn TextColumn
0             10     Item 1
1             20     Item 2
2             30     Item 3
3             40     Item 4
4             50     Item 5
5             60     Item 6
6             70     Item 7
7             80     Item 8
8             90     Item 9
9            100    Item 10
</code></pre><h2 id="code">Code</h2>
<p>I won&rsquo;t go into the same steps in detail here. Rather, I&rsquo;ll highlight a couple of important points:</p>
<ul>
<li>I configured the following <code>azure_blob_storage</code> site definition in my <a href="https://github.com/gontcharovd/great_expectations_azure/blob/1788ea3b2f1195d2290ff2c8a4c6f32b0702eb4b/gx/great_expectations.yml#L83">great_expectations.yml</a> file.</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="w">  </span><span class="nt">azure_blob_storage</span><span class="p">:</span><span class="w">  </span><span class="c"># this is a user-selected name - you can select your own</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">class_name</span><span class="p">:</span><span class="w"> </span><span class="l">SiteBuilder</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">store_backend</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="nt">class_name</span><span class="p">:</span><span class="w"> </span><span class="l">TupleAzureBlobStoreBackend</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="nt">container</span><span class="p">:</span><span class="w"> </span><span class="l">\$web</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="nt">connection_string</span><span class="p">:</span><span class="w"> </span><span class="l">${AZURE_STORAGE_CONNECTION_STRING}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">site_index_builder</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">class_name</span><span class="p">:</span><span class="w"> </span><span class="l">DefaultSiteIndexBuilder</span><span class="w">
</span></span></span></code></pre></div><ul>
<li>
<p>If you are running the [setup_gx.py] file for the first time, don&rsquo;t forget to set <code>do_config = True</code> and update the path in <code>CONTEXT_DIR</code> to your system.</p>
</li>
<li>
<p>Don&rsquo;t forget to set your <a href="https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string">Azure Blob Storage connection string</a> in the <code>connection_string</code> option. As this value is secret, it&rsquo;s not included in my repository.</p>
</li>
</ul>
<h1 id="result">Result</h1>
<p>I think the result looks pretty neat. The <em>index.html</em> along with the other files are created in a Blob Container <strong>$web</strong>:</p>
<p><img loading="lazy" src="/posts/blog/great-expectations-azure/container.png" type="" alt=""  /></p>
<p>The final result can be accessed by anyone on the internet <a href="https://gxstorageacc.blob.core.windows.net/$web/index.html#">here</a>. We see that five GX runs have been made, resulting in five Validation Results.</p>
<p><img loading="lazy" src="/posts/blog/great-expectations-azure/report.png" type="" alt=""  /></p>
<p>Note how the Expectation Suites tab gives more information about the Expectation Suites, in this case <strong>panda_expectations</strong>. This feature gives business users clear information about how the date they&rsquo;re using has been tested.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Overall, I quite like the Great Expectations framework so far. Comparing it to <a href="https://gontcharov.eu/posts/exploring-soda-data-quality-framework/#use-case">Soda Core</a> there&rsquo;s a couple of points I prefer about GX:</p>
<ol>
<li>GX has more open-source features, e.g. Data Docs.</li>
<li>Although GX is more convoluted, the organization of Expectations into Suites allows to maintain order as the project grows.</li>
<li>The community behind GX seems sufficiently active.</li>
<li>No date is shared outside of the company to access certain features, as opposed to Soda Cloud.</li>
</ol>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Read more about Role Based Access Control (RBAC) on Azure <a href="https://learn.microsoft.com/en-us/azure/role-based-access-control/overview">here</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    
    <item>
      <title>Exploring Soda Data Quality Testing on Databricks</title>
      <link>https://gontcharov.eu/posts/blog/exploring-soda-data-quality-framework/</link>
      <pubDate>Fri, 14 Feb 2025 10:08:53 +0100</pubDate>
      
      <guid>https://gontcharov.eu/posts/blog/exploring-soda-data-quality-framework/</guid>
      <description>&lt;h1 id=&#34;use-case&#34;&gt;Use Case&lt;/h1&gt;
&lt;p&gt;For my current engagement I&amp;rsquo;m tasked with developing an automated data quality framework for a large industrial enterprise in the renewable energy sector. The client has over a hundred independent SCADA systems from various vendors gathering energy production data. All this data has to flow in one central repository to be analyzed with Databricks. The client is obligated to ensure high data quality for contractual reporting to external parties. Failure to deliver incurs high financial penalties.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="use-case">Use Case</h1>
<p>For my current engagement I&rsquo;m tasked with developing an automated data quality framework for a large industrial enterprise in the renewable energy sector. The client has over a hundred independent SCADA systems from various vendors gathering energy production data. All this data has to flow in one central repository to be analyzed with Databricks. The client is obligated to ensure high data quality for contractual reporting to external parties. Failure to deliver incurs high financial penalties.</p>
<h1 id="soda">Soda</h1>
<p>Soda built their product offering Soda cloud on top of the open-source Python package Soda Core<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>. In a nutshell, their solution allows you to:</p>
<ol>
<li>Test your data pipelines</li>
<li>Monitor data quality</li>
<li>Create reports</li>
</ol>
<p><img loading="lazy" src="/posts/blog/exploring-soda-data-quality-framework/soda_architecture.png" type="" alt=""  />
<em>Image source: <a href="https://www.soda.io/integrations/databricks">Exploring the Soda Data Quality Platform</a></em></p>
<h1 id="simple-workflow">Simple Workflow</h1>
<p>To test the framework, I defined a simple workflow that performs a number of data quality checks on two large datasets.</p>
<h2 id="installation">Installation</h2>
<p>First I install the framework on my Databricks notebook:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">pip install -i https://pypi.cloud.soda.io soda-spark-df
</span></span></code></pre></div><p>Next, I import and instantiate the <code>Scan</code> class, create a data definition name, and set the data source to be tested to a Spark dataframe:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">soda.scan</span> <span class="kn">import</span> <span class="n">Scan</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Create a scan object</span>
</span></span><span class="line"><span class="cl"><span class="n">scan</span> <span class="o">=</span> <span class="n">Scan</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Set a scan definition</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Use a scan definition to configure which data to scan,</span>
</span></span><span class="line"><span class="cl"><span class="c1"># and when and how to execute the scan.</span>
</span></span><span class="line"><span class="cl"><span class="n">scan</span><span class="o">.</span><span class="n">set_scan_definition_name</span><span class="p">(</span><span class="s2">&#34;Data Completeness&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">scan</span><span class="o">.</span><span class="n">set_data_source_name</span><span class="p">(</span><span class="s2">&#34;spark_df&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Attach a Spark session</span>
</span></span><span class="line"><span class="cl"><span class="n">scan</span><span class="o">.</span><span class="n">add_spark_session</span><span class="p">(</span><span class="n">spark</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="configuration">Configuration</h2>
<p>I like the simplicity of Soda&rsquo;s YAML configuration that is stored in my Databricks workspace.</p>
<p>The connection to Scoda Cloud&rsquo;s dashboard and datasources is configured in a <em>sonda_conf.yml</em> file:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-YAML" data-lang="YAML"><span class="line"><span class="cl"><span class="nt">soda_cloud</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">host</span><span class="p">:</span><span class="w"> </span><span class="l">cloud.soda.io</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">api_key_id</span><span class="p">:</span><span class="w"> </span><span class="l">2bcda34c-xxxx-xxxx-xxxx-xxxxxxxxxxxx</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">api_key_secret</span><span class="p">:</span><span class="w"> </span><span class="l">zuNLl1k55YM_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</span><span class="w">
</span></span></span></code></pre></div><p>The actual data quality checks are defined in a <em>checks.yml</em> file:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-YAML" data-lang="YAML"><span class="line"><span class="cl"><span class="nt">checks for table_one</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">missing_count(site_id) = 0</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">Ensure there are no null values in the site ID column</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">duplicate_count(site_id) = 0</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">Ensure there are no duplicate site ID&#39;s</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">checks for table_two</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">missing_count(device_id) = 0</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">Ensure there are no null values in the site ID column</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">num_devices &gt; 50</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">num_devices query</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">        SELECT COUNT(DISTINCT device_id) FROM table_two</span><span class="w">
</span></span></span></code></pre></div><p><em>Note how I&rsquo;m defining a custom data quality check with SQL code.</em></p>
<p>The configuration is then added to the <code>scan</code> instance:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">scan</span><span class="o">.</span><span class="n">add_sodacl_yaml_file</span><span class="p">(</span><span class="s2">&#34;/Workspace/Users/Denis/Soda/soda_settings/checks.yml&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">scan</span><span class="o">.</span><span class="n">add_configuration_yaml_file</span><span class="p">(</span><span class="s2">&#34;/Workspace/Users/Denis/Soda/soda_settings/soda_conf.yml&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="result">Result</h2>
<p>Executing the checks and viewing the logs is straightforward:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">scan</span><span class="o">.</span><span class="n">execute</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">scan</span><span class="o">.</span><span class="n">get_logs_text</span><span class="p">())</span>
</span></span></code></pre></div><p>This is an extract of the output:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">INFO   <span class="p">|</span> Scan summary:
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span> 2/4 checks PASSED:
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>     table_one in spark_df
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>       Ensure there are no null values in the site_id column <span class="o">[</span>PASSED<span class="o">]</span>
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>     table_one in spark_df
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>       Ensure there are no null values in the site_id column <span class="o">[</span>PASSED<span class="o">]</span>
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span> 2/4 checks FAILED:
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>     table_two in spark_df
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>       Ensure there are no duplicate site_id<span class="err">&#39;</span>s <span class="o">[</span>FAILED<span class="o">]</span>
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>         check_value: <span class="m">1</span>
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>     table_two
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>       num_devices &gt; <span class="m">50</span> <span class="o">[</span>FAILED<span class="o">]</span>
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span>         check_value: 41.0
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span> Oops! <span class="m">2</span> failures. <span class="m">0</span> warnings. <span class="m">0</span> errors. <span class="m">2</span> pass.
</span></span><span class="line"><span class="cl">INFO   <span class="p">|</span> Sending results to Soda Cloud
</span></span></code></pre></div><p>We see that only two of the four checks pass successfully:</p>
<ul>
<li>The <code>site_id</code> values are not unique for every row.</li>
<li>There are less than 50 distinct <code>device_id</code> values.</li>
</ul>
<h1 id="reporting">Reporting</h1>
<p>What I like about Soda Cloud is the out-of-the-box visualization of my test results. At a glance, I can instantly see how my data quality varies from day-to-day, troubleshoot issues, view data, and even send alarms.</p>
<p><img loading="lazy" src="/posts/blog/exploring-soda-data-quality-framework/soda_dashboard.png" type="" alt=""  />
<em>Screenshot of my Soda Dashboard</em></p>
<h1 id="verdict">Verdict</h1>
<p>Soda looks promising, but it&rsquo;s too early to tell whether it fits my current use cases.</p>
<p>What I like:</p>
<ul>
<li>Minimalistic design</li>
<li>Intuitive configuration</li>
<li>Convenient dashboard</li>
</ul>
<p>What I <em>don&rsquo;t</em> like:</p>
<ul>
<li>Data is sent off-premises</li>
<li>Financial cost<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></li>
<li>Not very mature product yet<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></li>
</ul>
<p>Overall, I&rsquo;m excited to continue exploring Soda as a potential alternative to <a href="https://greatexpectations.io/">Great Expectations</a>!</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Soda Core <a href="https://docs.soda.io/soda-core/overview-main.html">documentation</a>&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Even though the cost is low, it still has to go through a lengthy procurement process.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>I spotted some errors and typos in the documentation.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
