This is part three in our AI-Ready Data Pipelines series. Read part one and part two here.
Consent is not the only threat your data will face as it journeys through your systems. There are many potential causes of data being damaged or incorrectly collected, routed, or transformed, such as application errors, network errors or, $deity forbid, human errors. Back in the day, bad data was less of a risk because datasets were few, mostly curated, and data was essentially a passive subject—and the data consumer could root out contaminated data before turning it into valuable insights.
Nowadays, however, data is an active subject which is used to make decisions in real-time to tailor user experiences, optimize business processes, and detect danger, and while algorithms are generally reliable and perform great in common use cases, they lack the general intuition that would cause an analyst to raise an eyebrow or two before making a decision—so it is vital that the data they consume is reliable to reduce the chances that you will end up in the news for an embarrassing consequence of an AI-related incident.
Let’s picture a content recommendation system that consumes events related to product interaction both in the page views and checkout process. What happens if, due to an error, product page views are mistakenly tagged as generic page views, with no category or price information? What happens if, in a hasty update, the category identifier is provided where the product identifier should be? These errors, which would probably pass a superficial schema verification, can severely impede the effectiveness of the content recommendation algorithm with potentially undesirable results. Bad data happens, so its reliability is tied to how quickly and effectively you can prevent or solve issues.
Data Contracts
Enter data contracts. Once we establish datasets for specific purposes, it is up to the data consumers to define what good data means, imposing strict control on schema and providing an exhaustive suite of data quality validation tests that identify known common issues and are expanded as new errors are identified. This will allow you several options:
- Filter out bad data – Either fixing it on the fly or blocking its delivery entirely, ensuring the data that the dataset holds is pristine, even if potentially incomplete.
- Real-time alerting – Either enabling engineers to identify data issues before they deploy changes or alerting data consumers immediately if there is an issue in your production environment, reducing the time it takes to fix it from potentially months to days, if not hours.
- A solid definition of data requirements – Data requirements can be defined in a slide, spreadsheet, document, email, or even a text message, and stored in a myriad of places making keeping track of it difficult and facilitating its disappearance in the mists of time. Data contracts are standard, centralized, and version-controlled, making it easier for all parties involved to know what good data looks like and guaranteeing timely communication of changes.
The primary cause of bad data is bad communication, so using a standard, automated method to define what data should look like will not only guarantee your algorithms will train and run on the best possible data—it will also accelerate how good data is obtained.