🧠 Second Brain

Search

Search IconIcon to open search

Data Contracts

Last updated Oct 11, 2024

WIP: This is an unfinished Note

Below are my uncleaned notes, I shared them as these are hopefully helpful to others. Once I have more time, I will convert it into a nice blog or article. For now, it’s a bit messy.

Data Contracts are API-like agreements between software/data engineers who own services and data consumers who understand how the business works. The goal is to generate well-modeled, high-quality, trusted, real-time data.

Summary

  • Definition & Purpose: Data Contracts serve as formal agreements between data producers and consumers, aiming to provide consistent, high-quality, real-time data. They allow for the decoupling of databases and services from analytics and ML needs, ensuring that modifications in the schema don’t break production due to built-in validation and enforcement mechanisms.
  • Usage & Tools: Companies like Convoy leverage tools like Protobuf and Apache Kafka for abstracting CRUD transactions, defining schemas based on data needs rather than source input. Confluent builds on Kafka with their Schema Registry, using concepts like Semantic Layer and Analytics API (via GraphQL) to attain similar goals. These contracts complement, rather than replace, data pipelines and the Modern Data Stack.
  • Characteristics & Patterns: Data Contracts encompass defined interfaces detailing data structure, type, and semantics; validation for adherence; metadata management; automated contract tests; and versioning akin to schema evolution. They differentiate from Data Mesh which focuses on the organizational structure but doesn’t prescribe specific data emissions or validation.
  • Discussion & Insights: Chad Sanderson likens data contracts to real-world agreements, emphasizing the balance between defining what the data is and how its quality is enforced versus understanding why the data is needed. These contracts are seen as a progressive shift in the data engineering realm, akin to the evolution of Agile methodologies in software development.

It’s an abstract that allows engineers to decouple their databases and services from analytics and ML requirements. Modifying the schema as it is validated and enforced will avoid production-breaking incidents.

Illustration by Chad Sanderson on The Rise of Data Contracts - by Chad Sanderson

Chad Sanderson said that at Convoy they use Protobuf and Apache Kafka to abstract the CRUD transactions. They define the schema based on what they need, not based on what they get from the source. Same as Software-Defined Asset define the Data Assets in a declarative manner, where you set the expectation.

Confluent also built similar functions on top of Kafka with their Schema Registry, and terms such as Semantic Layer and Analytics API (with GraphQL) trying to achieve similar things.

Data Contracts are not meant to replace data pipelines and the Modern Data Stack, a more batch approach. They are good for fast prototyping. With some knowledge about data, you could start defining data contracts.

Interestingly also the differentiation to Data Mesh is an organizational framework with a micro-service approach to data. Data Mesh doesn’t inform which data should be emitted, or validate the data being emitted from production is correct or conforms to a consumer’s expectations.

Also, data contracts are a form of Data Governance. This term is very vague and gets more concrete with explicit contracts. You can also use Great Expectations to set expectations for your data, which I believe is a good way to start.

# Patterns

  1. Defined Interface: The structure, type, and semantics of data.
  2. Validation: Ensuring data adheres to the defined contract.
  3. Metadata Management: Storing additional information about the data, like who produced it and when.
  4. Contract Tests: Automated tests to ensure data adheres to the contract.
  5. Versioning: Just like schema evolution, data contracts may also have versions.

Recent conversation on Data Contracts It reminds me of “ Don’t Fall for the Hype”. (see Repeated Terms in Tech)

Isn’t data contracts just another word for data quality and what we’ve always done with schema evolution, drift, and general data governance? In any case, data quality and the change management around data schema will not go away, the contrary. With ever-increasing complexity, it remains essential, and integrating it into a thought through data modeling architecture.

# From the Discussion on YouTube w/ Chad Sanderson vs Ethan Aaron

Chad Sanderson says in Data Contract Battle Royale w/ Chad Sanderson vs Ethan Aaron - YouTube (which was a hot topic before the weekend on 2022-09-09, e.g. Chad Sanderson says a data contract)

If we draw the line to Orchestrations vs Choreography, to me it’s just a new term of using a declarative way of orchestration where:

E.g. in declarative pipelining we say:

In short, an imperative pipeline tells how to proceed at each step in a procedural manner. In contrast, a declarative data pipeline does not tell the order it needs to be executed but instead allows each step/task to find the best time and way to run. The how should be taken care of by the tool, framework, or platform running on.

Important

It’s similar, but the how, which is figured out by dagster or by Terraform for Kubernetes, is here the contract! But what is defined inside the Software-Defined Asset. The data quality and contract could just be some Great Expectations contract

Ethan Aaron is saying his problem with Data Contracts is, that you focus on defining the interface/contract too early. E.g. if you have a big task done by several teams or people, you have a contract to agree on an interface. I’d argue, that’s exactly what the Data Products are, and instead of agreeing on some artificial contract, agree on the product, so the tools and teams can be totally distinct.

mehdio is writing about it here: Data Contracts — From Zero To Hero | by mehdio | Sep, 2022 | Towards Data Science: He is saying that Apache Kafka could be the interface that defines the contract: My comments to mehdio article and data contracts:

I like the points from Chad Sanderson saying:

  • Data Contract: what is the data and how do we enforce this quality.
  • Data Product: which is why do we need this data.

My takeaway from the data contracts so far:
Instead of building artificial interfaces, we should use data assets/products as interfaces using a declarative way of defining the contract of these data assets. The teams producing and consuming could be two separate teams, companies, or whatnot. It is what Dagster is doing with software-defined assets and why these are so powerful. I also see data contracts tightly coupled with data quality; basically a set of Great Expectations rules.

Dagster related discussions:

The interest in Data Mesh and data contracts is born out of related frustrations and a desire to enforce some degree of consistency between, among other things, disconnected products and services. RW The Conglomerate - By Benn Stancil - benn.substack

# How Data Contracts stick

Podcast Why You’ll Need Data Contracts (w/ Chad Sanderson + Prukalpa) — The Analytics Engineering Podcast — Overcast.

# Why Data Contracts didn’t stick

Another good read why they didn’t stick by Daniel Beach: Are Data Contracts For Real?.

Two tools, both of which I am familiar with and have used, are suggested for the actual implementation and encoding of Data Contracts.

Daniel says:

It appears to me that Data Contracts are, at a simplified level, an idea of using existing tooling (they picked the wrong ones), to enforce the “ideal” way to “solve common data problems” by ensuring nothing can “just change.

He goes on:

I agree with Daniel that the idea sounds good, and we, as data engineers, have been fighting with bad data for decades. We just called it schema change or evolution (Schema Evolution).

IMO, Data quality tools integrated into orchestrators are the way. Especially if the orchestrator is Data Asset-driven in a declarative way. This means you can create assertions on top of data assets (your debt tables, your data marts), not on data pipelines. So every time a data asset gets updated, you are certain the “contract” (assertions) are true.

Another option is to add it a level further to the database or Data Lake Table Format.

# Data Contract CLI

See Data Contract CLI, which provides a specification in the form of a YAML file to define the contract, including a CLI tool.


Origin: RW The Rise of Data Contracts - By Chad Sanderson, YouTube discussion by Chad and Ethan ,
References:
Created 2022-09-09