In a meeting today:

If we make our data completeness test sufficiently granular, it will help us discover the root cause.

No no no.

A data completeness test, e.g. “sensor TT01-PVO02 delivers yesterday’s data for column X”, only highlights that data isn’t there. It says nothing about why the data isn’t there, no matter how granular:

  • Was the asset offline?
  • Did the historian run out of disk space?
  • Did the certificate on the jump host’s proxy expire?
  • Did the data land on the raw (bronze) layer, but wasn’t processed onto the intermediate (silver) layer?

I like to split this problem into diagnosis and troubleshooting. The tests are diagnosis: they show what is wrong. The subsequent (often manual) troubleshooting reveals why it went wrong.

I have not yet seen a practical automated solution for the troubleshooting phase (that can cross the IT-OT boundary).