book_summaries

Book Summary: Driving Data Quality with Data Contracts

Hello, my name is Cynthia, and welcome to this summary of “Driving Data Quality with Data Contracts” by Andrew Jones. For the last two weeks, I’ve delved deep into learning about Data Contracts and I’m excited to share this with you.

In this book I’ve discovered how data contracts can solve common problems many companies have with their data. Getting good quality data is a big challenge for everyone, even with a lot of money spent. Interestingly, while each company’s data may be different, the challenges are universally similar!

Common Challenges

Some of the key challenges that most companies face that can be solved by data contracts include:

  1. Inconsistency Across Data Sources: With multiple sources of data, inconsistencies often arise. Data contracts standardize the format and quality of data, ensuring that each data source conforms to a predetermined standard.
  2. Lack of Trust in Data: It’s not uncommon for stakeholders to be unsure about the accuracy or integrity of the data they’re working with. Data contracts act as a form of guarantee, assuring that the data meets certain quality and accuracy thresholds.
  3. Ambiguity in Data Interpretation: Without a clear understanding of what each data point represents, users can interpret data in varied ways. Data contracts define the semantics of data, eliminating ambiguity and promoting a consistent understanding.
  4. Costly Data Cleaning Efforts: Poor data quality can lead to expensive and time-consuming data cleaning initiatives. By ensuring data quality at the source with data contracts, organizations can minimize or eliminate these additional costs.
  5. Poor Data Integration: Merging data from different sources can be challenging when there’s no standard format or structure. Data contracts facilitate seamless integration by stipulating the format and quality that each data source must adhere to.
  6. Difficulty in Compliance and Auditing: Ensuring data compliance with industry standards or regulations can be complex. Data contracts can include compliance checkpoints, making audits smoother and reducing the risk of non-compliance penalties.

data-book-summary-1

The diagram shows a typical data pipeline and how at each stage the lack of defined expectations ultimately results in the consumers losing trust in business-critical data-driven products.

data-book-summary-2

Despite the improvements in the technology and architecture over three generations of data platform architectures, we still have that bottleneck of a central team (data engineering) with a long backlog of datasets to make available to the organization.

Definition

Let us first familiarize ourselves with some important terminologies.

Term Definition
Data Contract A formal agreement between data providers and data consumers about the format, quality, and other properties of the data being exchanged.
Schema Defines the structure of the data, including the fields, data types, and relationships. It acts as a blueprint for the data to be exchanged and helps in validating the data against the defined structure.
Metadata Data about the data. It includes information like data lineage, timestamps, data quality metrics, and other details that describe the characteristics and context of the data.
Data Quality A measure of the condition or caliber of data, which considers aspects such as accuracy, completeness, reliability, relevance, and timeliness. Ensuring high data quality is pivotal for making informed decisions.
Data Lineage Visualization of the flow and transformation of data as it moves through the various stages of a system or process. It helps in understanding the origins, movements, and calculations applied to the data.
Data Provider The entity or system that produces or supplies data. Data providers are responsible for ensuring that the data meets the agreed-upon standards and specifications outlined in the data contract.
Data Generator/Producer/Provider These terms are often used interchangeably to describe the entity, system, or process that creates, supplies, or makes data available for use. They are responsible for maintaining the quality, accuracy, and security of the data according to the agreed-upon standards in the data contract.
Data Consumer This entity, application, or individual utilizes the data provided by the data generator/producer/provider. Data consumers use the data for various purposes such as analysis, reporting, or to feed into other systems or processes. They rely on the data contract to understand the format, quality, and characteristics of the data they are consuming.
Service Level Agreement (SLA) A commitment between the data provider and the data consumer on the level of service, including data availability, timeliness, and quality.
Data Governance The practice of managing and organizing data to ensure data quality, security, and compliance with policies and regulations. It involves defining roles, responsibilities, and processes related to data management.
Versioning The practice of keeping multiple versions of data to track changes and updates over time. It helps in managing and controlling modifications to the data.
Data Validation The process of checking and ensuring that the data meets the predefined standards and specifications before it is shared or used.

One-Minute Summary

For a Technical Audience

Data Contract as a Formal Agreement:

A Data Contract is a formal agreement detailing the specification of data exchanged between a Data Provider and a Data Consumer. It serves as the backbone for smooth, reliable, and secure data exchange, involving various key concepts:

  1. Schema and Metadata:
    • Specification: The contract defines the schema and metadata of the data, specifying the format, type, and structure, ensuring that data is consistent and usable.
  2. Data Quality and Integrity:
    • Requirements: It outlines the quality requirements, such as accuracy, completeness, and timeliness, maintaining the reliability and integrity of the data exchanged. Data Lineage and Data Validation are integral to this process.
  3. Compliance and Governance:
    • Standards and Regulations: The contract addresses compliance with data standards, regulations, and governance policies, mitigating risks associated with data misuse and breaches. Data Governance and Service Level Agreements (SLA) play crucial roles in ensuring compliance and setting expectations on data availability and quality.
  4. Dispute Resolution:
    • Clarity and Accountability: It provides a clear framework for resolving disputes and ensuring accountability, fostering trust and collaboration between data producers and consumers.

Deep Dive

Data Contracts: Two Main Aspects

The solution data contracts provide consists of two aspects.

First, they set up a contract-backed architecture which makes it easier to create and use good quality data through self-served, autonomous tooling:

Second, they facilitate a shift in data culture, emphasizing data explicitly generated to meet use cases, fostering collaboration between data generators and consumers, and prioritizing data quality over quantity:

In essence, data contracts provide a structured framework that not only streamlines data-related processes through automation but also promotes a cultural shift that prioritizes data’s actual utility and quality over mere volume.

data-book-summary-3

1. Data Producers:

2. Data Consumers:

3. Data Stewards, Contract Architects (or Data Architects):

What does it mean to you?

For Technical Team 💻

What Does It Mean? To the technical team, a data contract is similar to how a software contract or an API works. Just as APIs have specifications detailing how they should function, data contracts specify how data should be structured, and formatted, and the quality standards it should meet.

Day-to-Day Impact:

Having a data contract means that the integration of new datasets into existing systems becomes more streamlined. Instead of manually reviewing and adjusting data, your ETL processes can automatically validate incoming data against the contract. Data pipelines become less prone to breaking due to inconsistent or unexpected data, which in turn reduces system downtimes and debugging sessions.

Analogy: Think of it as writing code with strong typing. Just as you’d want to know the type of a variable before processing it, with data contracts, you’ll know the “type” and quality of data you’re working with.

Why Do You Need It? Data contracts, when combined with automated validation tools and data quality monitoring systems, help in the early detection of anomalies. This ensures that only clean and compliant data enters your systems. Data cataloging tools that work in tandem with data contracts provide clarity about the data’s origin, transformations, and quality, thus making data more discoverable and trustworthy.

**Takeaways**

When data contracts are in place, you will need to make adjustments to your data integration and processing workflows. You should be prepared to ensure the setup of automated validation processes that will inspect incoming data against the contract criteria. This might require you to adopt new tools or adapt existing ones to integrate seamlessly with the data contract’s specifications. Further, you’ll find it beneficial to have closer collaborations with the data generators and data consumers to ensure that any data produced is in line with the contract from the get-go. As the volume of data continues to grow, proactively embedding these validation and integration checks based on data contracts will save you countless hours in debugging and data cleansing down the road.


For Business Team 📈

What Does It Mean? For the business team, a data contract is like a service-level agreement (SLA) but for data. It ensures that the data they receive meets certain quality and structural standards, much like how an SLA guarantees a level of service.

Day-to-Day Impact:

With data contracts in place, reports and dashboards become more reliable. You won’t have to spend time verifying and cross-checking data sources, or reconciling discrepancies. This leads to quicker insights and the ability to act on them in a timely manner.

Analogy: It’s like ordering a customized product. You specify your requirements, and you expect the delivered product to match those. Data contracts ensure your data “deliveries” meet your specifications.

Why Do You Need It? Data visualization tools and business intelligence platforms, when fed with contract-compliant data, yield more accurate representations of business metrics. The integration of data contracts with these platforms ensures that data meets business expectations in terms of quality and structure, leading to more meaningful and actionable insights.

**Takeaways**

With data contracts implemented, your day-to-day operations will be infused with a higher degree of data reliability. However, this also means you need to have a clearer understanding of the specifications outlined in these contracts. It’s not just about consuming data; it’s about knowing the quality and reliability benchmarks that data adheres to. You’ll need to establish more regular communication lines with the technical and compliance teams, ensuring that your data needs and the evolving business requirements are continuously reflected in the data contracts. As you receive more consistent data, you can then focus on making accurate business decisions rather than spending time questioning data integrity.


For Compliance Team ⚖️

What Does It Mean? To the compliance team, a data contract represents a binding agreement that ensures data adheres to regulations, privacy laws, and internal policies.

Day-to-Day Impact:

Data contracts simplify audit trails. When data is ingested into the system, it’s already aligned with regulatory standards. This reduces the need for retrospective adjustments and simplifies compliance reporting.

Analogy: It’s like a safety checklist for a manufacturing unit. Before a product rolls out, it needs to meet certain safety standards. Data contracts are your “safety checklist” for data.

Why Do You Need It? Regulatory technology (RegTech) platforms, when combined with data contracts, offer real-time compliance monitoring. They ensure that data entering the systems meets regulatory standards, thus minimizing the risks of breaches. Data masking and tokenization technologies, in coordination with data contracts, also ensure that sensitive information is treated with the utmost care, reinforcing data privacy.

**Takeaways**

As a member of the compliance team, data contracts will become your allies in ensuring regulatory adherence. However, this means you’ll need to be actively involved in drafting, reviewing, and updating these contracts. You’ll have to work more closely with both the technical and business teams, understanding their requirements and ensuring that the data contracts address regulatory mandates effectively. The static, periodic compliance checks might evolve into real-time monitoring, and you’ll need to be adept at using RegTech platforms that align with data contract validations. Your role will be proactive rather than reactive, emphasizing preventive measures over corrective actions.


For Strategy and Management Team 🧐

Day-to-Day Impact: Strategies become data-driven, and management can be assured that their strategic decisions are based on a single version of the truth. Forecasting becomes more accurate, and scenario planning becomes more reliable.

Why Do You Need It & What Technology Enables This? Advanced analytics and AI models yield better results when they operate on high-quality data. Data contracts, when integrated with data lakes and analytics platforms, ensure that these models receive data of a predetermined quality and format. This harmonization, in turn, leads to better predictive insights, helping strategy and management teams in proactive decision-making.

**Takeaways**

With the introduction of data contracts, you, in the strategy and management team, will need to place greater emphasis on data-driven decision-making. This means fostering an environment where data accuracy and reliability are paramount. You should familiarize yourself with the nuances of data contracts to understand the foundation upon which your strategic insights are built. Additionally, your engagement with other teams will need to be more frequent, ensuring that the data contracts align with the organization’s long-term vision and objectives. While you’ll enjoy the benefits of more reliable data, it also necessitates a proactive approach in ensuring that data contracts keep pace with the evolving business landscape.

❓ FAQs

1. What is a data contract?


2. Why are data contracts important?


3. How does a data contract differ from a data model or schema?


4. Who is responsible for creating and maintaining a data contract?


5. How do data contracts facilitate automation?


6. What happens if there’s a breach of the data contract?


7. Can data contracts be revised or updated?


8. How do data contracts impact data culture in an organization?


9. How can an organization start implementing data contracts?

About this Summary

Again, this summary is here to give you a quick peek and hopefully get you interested in learning more about data contracts

A one-size-fits-all explanation just doesn’t cut it for people from different backgrounds. So, I’m aiming to share stories and insights that connect with different folks because I believe that’s the way to make a real impact and understanding.

If you’re keen for more details, you can always get in touch with me or John Thomas - Staff Engineer of D&A Platforms and an expert reviewer of this book.

Further Readings

The book is structured into ten chapters, divided into three distinct sections. Today, we delved into the first two sections, discussing the reasons behind “Why” Data Contracts and exploring the “What” of Data Contracts. Additionally, we touched upon the “Culture Change” associated with Data Contracts, which represents one of the core facets of the concept. The final section, “Data Architecture and Data Contract,” provides insights into the practical design and incorporation of data contracts within an organization.

I am also preparing for a series of technical reviews in the near future. If you’re keen on deepening your understanding of data contracts, don’t miss out on the upcoming series.

And finally, your feedback and thoughts are always welcome!

Good resources

GitHub - paypal/data-contract-template: Template for a data contract used in a data mesh.

Data Mesh Manager

Data Contract Specification

Technical Stack to look into: