AI-Ready Data Pipelines: Reliable Data Is Trustworthy Data (Part 3)

Estimated Reading Time: 4 minutes

AI

Jordi Roura

October 7, 2024

This is part three in our AI-Ready Data Pipelines series. Read part one and part two here.

Consent is not the only threat your data will face as it journeys through your systems. There are many potential causes of data being damaged or incorrectly collected, routed, or transformed, such as application errors, network errors or, $deity forbid, human errors. Back in the day, bad data was less of a risk because datasets were few, mostly curated, and data was essentially a passive subject—and the data consumer could root out contaminated data before turning it into valuable insights.

Nowadays, however, data is an active subject which is used to make decisions in real-time to tailor user experiences, optimize business processes, and detect danger, and while algorithms are generally reliable and perform great in common use cases, they lack the general intuition that would cause an analyst to raise an eyebrow or two before making a decision—so it is vital that the data they consume is reliable to reduce the chances that you will end up in the news for an embarrassing consequence of an AI-related incident.

Let’s picture a content recommendation system that consumes events related to product interaction both in the page views and checkout process. What happens if, due to an error, product page views are mistakenly tagged as generic page views, with no category or price information? What happens if, in a hasty update, the category identifier is provided where the product identifier should be? These errors, which would probably pass a superficial schema verification, can severely impede the effectiveness of the content recommendation algorithm with potentially undesirable results. Bad data happens, so its reliability is tied to how quickly and effectively you can prevent or solve issues.

Data Contracts

Enter data contracts. Once we establish datasets for specific purposes, it is up to the data consumers to define what good data means, imposing strict control on schema and providing an exhaustive suite of data quality validation tests that identify known common issues and are expanded as new errors are identified. ‌This will allow you several options:

Filter out bad data – Either fixing it on the fly or blocking its delivery entirely, ensuring the data that the dataset holds is pristine, even if potentially incomplete.
Real-time alerting – Either enabling engineers to identify data issues before they deploy changes or alerting data consumers immediately if there is an issue in your production environment, reducing the time it takes to fix it from potentially months to days, if not hours.
A solid definition of data requirements – Data requirements can be defined in a slide, spreadsheet, document, email, or even a text message, and stored in a myriad of places making keeping track of it difficult and facilitating its disappearance in the mists of time. Data contracts are standard, centralized, and version-controlled, making it easier for all parties involved to know what good data looks like and guaranteeing timely communication of changes.

The primary cause of bad data is bad communication, so using a standard, automated method to define what data should look like will not only guarantee your algorithms will train and run on the best possible data—it will also accelerate how good data is obtained.

Do you have questions about your data?

Our team of experts is here to help whenever you need us.

Talk to Us

Author

Jordi Roura

Jordi Roura is a Senior Data Engineer specializing in data streaming, data quality, and data privacy. With almost two decades of experience, his passion for education and knowledge-sharing has made him a regular contributor at community and industry conferences and events all over the world. Jordi is passionate about the ethical use of data, and how data can be used to transform society, and strives to help teams extract the most value without compromising user rights. He currently resides in Cincinnati, Ohio, where he moved to from his native Barcelona, extending a long career that currently spans five out of seven continents.
View all posts

Last Updated: October 7, 2024

Share This Page

New Tool Alert: Quickly audit your Google Analytics 4 setup with InfoTrust's GA4 Scorecard.

Who We Are

What We Do

Industries

Resources

AI-Ready Data Pipelines: Reliable Data Is Trustworthy Data (Part 3)

AI

Data Contracts

Do you have questions about your data?

Author

Stay in the loop with our latest updates!