Compute as a Data Quality Testing Bottleneck

Last week I had an interesting call with a Data Governance expert. One particular thing he said really struck me:

“We were testing to much incoming data, that we were burning tens of thousand of Euro every month on compute.”

This is a profound problem. In my current project, we’re expected to test the data quality of an existing system before commissioning it. This means we want to know if the new system performs just as well as the old system. Ergo, comparing all the signals.

At what point do we have to ask ourselves if this is computationally feasible? I don’t think it’s possible to analyze all incoming data if you work with high-frequency SCADA time series.

A smarter testing strategy is required.