Data Quality Tests are Not Root Cause Analysis

In a meeting today:

If we make our data completeness test sufficiently granular, it will help us discover the root cause.

No no no.

A data completeness test, e.g. “sensor TT01-PVO02 delivers yesterday’s data for column X”, only highlights that data isn’t there. It says nothing about why the data isn’t there, no matter how granular:

Was the asset offline?
Did the historian run out of disk space?
Did the certificate on the jump host’s proxy expire?
Did the data land on the raw (bronze) layer, but wasn’t processed onto the intermediate (silver) layer?

I like to split this problem into diagnosis and troubleshooting. The tests are diagnosis: they show what is wrong. The subsequent (often manual) troubleshooting reveals why it went wrong.

I have not yet seen a practical automated solution for the troubleshooting phase (that can cross the IT-OT boundary).