In a meeting today:
If we make our data completeness test sufficiently granular, it will help us discover the root cause.
No no no.
A data completeness test, e.g. “sensor TT01-PVO02 delivers yesterday’s data for column X”, only highlights that data isn’t there. It says nothing about why the data isn’t there, no matter how granular:
- Was the asset offline?
- Did the historian run out of disk space?
- Did the certificate on the jump host’s proxy expire?
- Did the data land on the raw (bronze) layer, but wasn’t processed onto the intermediate (silver) layer?
I like to split this problem into diagnosis and troubleshooting. The tests are diagnosis: they show what is wrong. The subsequent (often manual) troubleshooting reveals why it went wrong.
I have not yet seen a practical automated solution for the troubleshooting phase (that can cross the IT-OT boundary).