🧠Second Brain
Search
Schema Evolution
In the realm of data lake table formats, Automatic Schema Evolution emerges as a pivotal feature. It addresses the persistent challenge of modifying formats in the data engineering landscape. Essentially, Schema Evolution enables the seamless addition of new columns or the expansion of data types, all without disrupting existing structures.
Renaming or reordering columns is also feasible, though it may impact backward compatibility. The beauty lies in the ability to alter a single table, with the table format efficiently propagating these changes across all distributed files. Impressively, this does not necessitate a complete rewrite of your table and its underlying files.
Explore the Patterns of Schema Evolution:
- Backward Compatibility: Ensures new schemas can interpret old data.
- Forward Compatibility: Allows old schemas to understand new data.
- Full Compatibility: A harmonious blend of backward and forward compatibility.
- Breaking Changes: Modifications that render old data unreadable.
- Additive Changes: Introduces new fields while preserving existing data integrity.
- Versioning: Manages diverse schema versions to accommodate varied data structures.
Schema Registry tools, as exemplified in Kafka, play a crucial role in overseeing schema versions and safeguarding compatibility. Additionally, the Data Vault model offers strategic solutions to the complexities of schema evolution.
Consider Data Contracts, often synonymous with Schema Evolution, for a broader understanding of this concept.
Other noteworthy tools in this domain include Protobuf, Schemata, and buz.
# Context and relation to data contracts
Schema evolution was initially used by Kafka and its service Schema Evolution, whereas the term Data Contract came only up lately around 2022-09-22.
Both have similar origin stories to handle the ever-evolving database schema that changes all the time. But already before we had the term schema evolution, we had to manage schema changes and change management of databases. I’d argue we had to coordinate less with the outside world but with internal DWH customers or just business people who needed the data.
Traditional databases have schema evolution challenges, while modern distributed systems have data contracts to maintain.
# Tools
- avrotize: Avrotize is a command-line tool for converting data structure definitions between different schema formats, using Apache Avro Schema as the integration schema model.
- More on Schema Registry
Origin:
Schema Evolution
References: Schema Drift
Created 2022-08-24