Welcome to the third and final installment of our series “Data Modeling: The Unsung Hero of Data Engineering.” If you’ve journeyed with us from Part 1, where we dove into the importance and history of data modeling, or joined us in Part 2 to explore various approaches and techniques, I’m delighted you’ve stuck around.
In this third part, we’ll delve into data architecture patterns and their influence on data modeling. We’ll explore general and specialized patterns, debating the merits of various approaches like batch vs. streaming and lakehouse vs. warehouse, and the role of a semantic layer in complex data architecture.
We’ll also survey the landscape of data modeling tools, comparing commercial and open-source options, and ponder the potential of AI in data modeling. To wrap up, we’ll introduce data modeling frameworks like ADAPT™ and BEAM, designed to guide effective data model creation. Please join me as we take this exciting journey toward understanding data architecture better.
Data Architecture Pattern with Data Modeling
Bringing together the best of the individual approaches and techniques and knowing the common problems, we must always keep the bigger data architecture picture in mind. Sometimes you want a data vault modeling for your first layer when your source system is constantly changing, or you need a dimensional model in your last layer to build data apps on top of it or allow self-serve analytics. But how are you doing this?
For these, we need to look at data architecture per se. In this chapter, I will list some of the most common architectural patterns I’ve seen, but without going into all details of each.
General Purpose Data Architecture Pattern (Medallion, Core, etc.)
Let’s start with the one that suited me best, the one you have seen in one form or another. When I started at a consulting firm called Trivadis, and here it was called “Foundational Architecture of Data Warehouse”. We followed best practices such as
We would design these layers from the top-down approach discussed above and decide the data modeling technique for each layer depending on requirements.
Let’s have a detailed look at each layer, as these are fundamental for every data architecture project, and they will help us understand why we’d want to model different layers differently. The following layers or areas belong to a complete Data Warehouse (DWH) architecture but can be implemented into any data lake or analytics product you use.
Data from various source systems is first loaded into the Staging Area.
- In this first area, the data is stored as it is delivered; therefore, the stage tables’ structure corresponds to the interface to the source system.
- No relationships exist between the individual tables.
- Each table contains the data from the final delivery, which will be deleted before the next delivery.
- For example: In a grocery store, the Staging Area corresponds to the loading dock where suppliers (source systems) deliver their goods (data). Only the latest deliveries are stored there before being transferred to the next area.
It must be cleaned before the delivered data is loaded into the Core. Most of these cleaning steps are performed in the Cleansing Area.
- Faulty data must be filtered, corrected, or complemented with singleton (default) values.
- Data from different source systems must be transformed and integrated into a unified form.
- This layer also contains only the data from the final delivery.
- For example: In a grocery store, the Cleansing Area can be compared to the area where the goods are commissioned for sale. The goods are unpacked, vegetables and salad are washed, the meat is portioned, possibly combined with multiple products, and everything is labeled with price tags. The quality control of the delivered goods also belongs in this area.
The data from the different source systems are brought together in a central area, the Core, through the Staging and Cleansing Area and stored there for extended periods, often several years.
- A primary task of the Core is to integrate the data from different sources and store it in a thematically structured way rather than separated by origin.
- Often, thematic sub-areas in the Core are called “Subject Areas.”
- The data is stored in the Core so that historical data can be determined at any later point in time.
- The Core should be the only data source for the Data Marts.
- Direct access to the Core by users should be avoided as much as possible.
In the Data Marts, subsets of the data from the Core are stored in a form suitable for user queries.
- Each Data Mart should only contain the data relevant to each application or a unique view of the data. This means several Data Marts are typically defined for different user groups and BI applications.
- This reduces the complexity of the queries, increasing the acceptance of the DWH system among users.
- For example, The Data Marts are the grocery store’s market stalls or sales points. Each market stand offers a specific selection of goods, such as vegetables, meat, or cheese. The goods are presented so that they are accepted, i.e., purchased, by the respective customer group.
Not every architecture is the same
Only some data warehouses or data engineering projects have precisely this structure. Some areas are combined, such as the Staging and Cleansing areas, or differently named. The Core is sometimes referred to as the “Integration Layer” or “(Core) Data Warehouse.”
However, the overall system must be divided into different areas to decouple the other tasks, such as data cleaning, integration, historization, and user queries. In this way, the complexity of the transformation steps between the individual layers can be reduced.
Why use it today?
Isn’t it amazing that something from 2016 is still so current? That’s why data modeling is getting into vogue again, as it has never been entirely outdated.
Databricks renamed these layers with bronze, silver, and gold to understand it may be a little better and called it Medallion Architecture, but it’s something every BI engineer works with every day. In essence, it’s the same concept.
Or if we look at the Data Engineering Lifecycle introduced by the Fundamentals of Data Engineering, we see a similar picture but on an even higher level. You could apply the different layers to the Storage layer in the image below.
Specialized Data Architecture Patterns
In this chapter, we look at patterns that might not be considered in formal data modeling, but each of its decisions will highly influence the data modeling part. Besides the general data architecture, I call these specialized data architecture patterns as these are higher-level data architecture decisions.
Batch vs. Streaming
An obvious decision you need to take early on is if you need real-time data for critical data application or batch with near real-time micro batching every minute, the hour is enough.
Still, to this day, steaming is primarily optional. Suppose you tell the business team you can achieve near real-time with hourly batching; they will be happy. The latest up-to-date data will little influence your data analytics. Most cases are looking at historical data anyway.
Nevertheless, some business-critical data solutions need real-time. However, be aware that the effort and challenges will be much more significant as you can only partially recover if a stream fails. But the good idea is to set up your data pipeline as even based on the streaming approach, and therefore can go lower and lower with the latency of your batch to achieve near real-time.
Data Lake/Lakehouse vs. Data Warehouse Pattern
As Srinivasan Venkatadri says correctly: “Data modeling, in general, should also talk about files (open formats), Lakehouses, and techniques to convert or extract structural data from semi or unstructured would help.” This is precisely where data lake/lakehouse or data warehouse patterns apply. With these, you need to think about these questions.
I will only go into some detail here, as I have written extensively about these topics. But the decision to go with a data lake or lakehouse will likely influence the data modeling to a more open source strategy and tools. Also, as you are dumping files into your lake, you need to work with large distributed files, where Table Format makes a lot of sense, to get database-like features on top of files.
Semantic Layer (In-memory vs. Persistence or Semantic vs. Transformation Layer)
The data lake decision is also highly influential with the new Rise of the Semantic Layer, where you run a logical layer on top of your files in the case of a data lake and the possibility of a data warehouse on top of your data marts.
This is an interesting one, as we talked in part 1 and part 2 about the Conceptual, Logical, and Physical Data Models where a Semantic Layer could replace the logical model. The semantic layer stores the queries in declarative YAMLs, and data not persisted in physical storage is executed at run time or when the data is fetched. Integrating the semantic layer into complex data architecture and overall data modeling makes sense. It can serve a wide range of data consumers with different requirements, such as direct SQL, API (REST), or even GraphQL, with the great benefit of writing business logic only once. Here again, as business logic is the biggest treasure of data modeling, we have many new options to model our data with semantic layers.
The counter architecture pattern is to go with a transformation layer that typically can be achieved with dbt or a Data Orchestrator, where you persist each step into physical storage. Mainly to gain faster speed querying these data sets and reusing the data sets with other marts or data apps. The transformation layer and its transformational modeling make a decision between table formats vs. in-memory formats, query push-downs, and data virtualization.
Modern/Open Data Stack Pattern
Next up is Modern Data Stack architecture, or Open Data Stack. This pattern is basically for sheer choice we have nowadays to choose from the open data stack. It is its architecture to choose the right tool best suited for the requirements at hand in the company.
It matters if you chose an open source tool, v0.5, that might get out of order in a couple of years or if you chose the v0.1 that made the crackdown the line. But these are complicated bets, and only experienced data architects and people working in the field can make decisions based on intuition for the right way a project is progressing.
It is also to say no to yesterday’s new shiny tool, resist the urge, and wait until the product is more mature. But at the same time to take risks and bet on a tool that is open to success and embraces open-source philosophy instead of building a worse copy in-house. Each enterprise company either uses a closed-source solution to handle the data engineering lifecycle or makes its stack or newer option, betting on a developed framework in the open.
Imperative vs. Declarative Pattern
The declarative approach is manifesting itself more and more. Started within the fronted revolution where react was declaring components, Kubernetes did it for DevOps, Dagster revolutionized orchestration with Software-Defined Assets or us at Airbyte, where we created a Low-Code Connector developer kit to create data integrations in minutes by filling out a YAML file.
It is only possible because of drastically reduced complexity and the need to write many boilerplates. The declarative way describes what and the how is taken care of in the framework.
Notebooks vs. Data Pipelines Pattern
A pattern directly related to orchestration is the notebooks versus data pipeline pattern, where you can write and run a pipeline solely a notebook or a mix of mature, unit-tested data pipelines with a stable orchestrator with all bells and whistles included.
Centralized and Decentralized Pattern
Suppose you are in the field for a while. How many cycles have you been through from doing everything server side, switching everything to server rendering, and forth and back with the battle of client vs. server, microservices vs. monoliths, and lately, central cloud data warehouses vs. a yet-to-show decentralized Data Mesh. And many more forth and back will follow.
Whatever you choose, start with simplicity. You can always add complexity later when the product and solution mature.
There are so many more patterns I will write about but at some time later. For now, I leave you with pointers above, and I’ll come back to it sometime later.
Data Modeling Tools
Popular data modeling tools include Sqldbm, DBDiagrams, Enterprise Architect, and SAP PowerDesigner. These tools are widely used in the industry and offer powerful features such as data modeling, profiling, and visualization.
Open-source data modeling tools such as MySQL Workbench and OpenModelSphere are free and offer essential features for creating data models. They are helpful for small projects and provide an opportunity for data engineers to learn data modeling skills.
Choosing the right data modeling tool depends on the organization’s needs, budget, and project size. Large organizations may require expensive enterprise-level tools, while small businesses may opt for open-source tools. Selecting a tool that is easy to use, has the needed features, and is compatible with the organization’s database management system is essential.
Other tools are Ellie.ai, whose key features are Data Product Design, Data Modeling, Business Glossary, Collaboration, Reusability, and Open API.
dbt can be seen as a transformation modeling tool. Dagster can be used as a DAG modeling tool. And so forth. But you can also use ExaliDraw for Markdown-based drawing or draw.io (lots of templates for AWS, Azure, etc.) to draw architectures.
If you struggle to think in dbt tables and SQL is not the SQL is not the right language. One problem, SQL is a declarative language, which is a blessing and a curse. Especially if you do recurring queries, SQL gets nasty spaghetti coded, which again dbt helps with Jinja Templates, but as it’s not a language, without much in-built support. Reconfigured (not free) was built for people without years of experience, focusing heavily on business logic.
What about ChatGPT?
With all the hype of generative AI, specifically ChatGPT, it asks if AI can model our data.
If we recap the information from this series, most of it condenses into translating business requirements into a data model or semantics, also called business logic. As Chad Sanderson mentioned in his post:
The hard part of data development is understanding how code translates to the real world. Every business has a unique way of storing data. One customer ID could be stored in a MySQL DB. Another could be imported from Mixpanel as nested JSON, and a third might be collected from a CDP. All three IDs and their properties are slightly (or significantly) different and must be integrated.
As Chad continues, and I strongly agree, he says, “As smart as ChatGPT might be, it would need to understand the semantics of how these IDs coalesce into something meaningful to automate any step of modeling or ETL. The algorithm must understand the real world, grok how the business works, and dynamically tie the data model to that understanding.”
This is not AI or ML anymore; that is more intuition, long-term experience in working in the field, and experiencing some of the challenges to get a feeling for it.
If there is a place where you do not want to use automation, then it would be in data modeling and the overall data architecture. Here we need discussions and a deep understanding (domain knowledge) of the field and data modeling.
On the other hand, based on your data model, you can use AI to generate schemas or the physical model of a database. Brainstorm with ChatGPT based on your created data model if something is missing or if it sees something that can be changed. Sometimes you get great insights from it by providing the solution you came up with.
Maybe most importantly, a data engineer will always be needed to assess the outcome, understand and make sense of the data in front of him and understand the overall data flow of the organization. That can only be delegated for a while.
Data Modeling Frameworks
Besides tools, there are also helpful frameworks that help you model your data, asking the right questions.
ADAPT says that more than existing data modeling techniques like ER and dimensional modeling is required for OLAP database design. That’s why ADAPT is a modeling technique designed specifically for OLAP databases. It addresses the unique needs of OLAP data modeling. The basic building blocks of ADAPT are cubes and dimensions, which are the core objects of the OLAP multidimensional data model.
Although ADAPT was created for OLAP cubes in the old days, most techniques and frameworks also apply to regular data modeling nowadays. There are nine ADAPT database objects, and their symbols illustrate how to use logos with simple examples.
Why ADAPT over ER and dimensional modeling
ADAPT is considered superior to ER and dimensional modeling for several reasons:
- Incorporation of Both Data and Process: ADAPT incorporates both data and process in its approach, which is particularly useful for designing (OLAP) data marts.
- Logical Modeling: ADAPT emphasizes logical modeling. This prevents the designer from jumping to solutions before fully understanding the problem.
- Enhanced Communication: ADAPT enhances communication among project team members, providing an everyday basis for discussion. This improved communication leads to higher-quality software applications and data models.
- Comprehensive Representation: ADAPT allows for the representation of an (OLAP) application in its entirety without compromising the design due to the limitations of a modeling technique designed for another purpose.
In summary, ADAPT is a says to be a more flexible, comprehensive, and communication-enhancing modeling technique for OLAP databases compared to ER and dimensional modeling.
BEAM for Agile Data Warehousing
BEAM, or Business Event Analysis & Modeling, is a method for agile requirement gathering designed explicitly for Data Warehouses, created by Lawrence Corr and Jim Stagnitto in the Agile Data Warehouse Design book. BEAM centers requirement analysis around business processes instead of solely focusing on reports.
It uses an inclusive, collaborative modeling notation to document business events and dimensions in a tabular format. This format is easily understood by business stakeholders and easily implemented by developers. The idea is to facilitate interaction among team members, enabling them to think dimensionally from the get-go and foster a sense of ownership among business stakeholders.
The principles of BEAM include:
- Modelstorming: Business intelligence is driven by what users ask about their business. The technical setting is secondary. Storyboarding the data warehouse to discover and plan iterative development
- Asking stories with 7W: Telling dimensional data stories using the 7Ws (who, what, when, where, how many, why, and how—model by example, not abstraction; using data story themes.
- Visual modeling: Sketching timelines, charts, and grids to model complex process measurement – simply
- Business Driven: Well-documented data warehouses that take years to deploy will always need to be updated. Business users will look elsewhere. Usually with business units: “I need it now, or I’d rather stick with Excel solution ….”
- Customer Collaboration: End users’ business knowledge is your greatest resource.
- Responding to Change: Promoting change through the above actions, leading to weekly delivery cycles.
- Agile design documentation: Enhancing star schemas with BEAM dimensional shorthand notation
Lawrence Corr emphasizes the importance of asking the right questions or “data stories.” For instance, a customer’s product purchase could trigger questions about the order date, purchase and delivery locations, the quantity bought, purchase reason, and buying channel. A comprehensive picture of the business process is formed by carefully addressing these questions and providing the basis for technical specifications.
Common Data Model
There are examples of standard data models, so you do not need to start from scratch. The concept behind these approaches is to transform data contained within those databases into a standard format (data model) and a common representation (terminologies, vocabularies, coding schemes), then perform systematic analyses using a library.
For example, every model needs dimensions such as customer, region, etc. Some references I found you see below:
- Microsoft started The Common Data Model (CDM).
- OMOP Common Data Model for health care. It allows for systematically analyzing disparate observational databases.
- Elastic has its Common Schema (ECS) Reference.
- A bit more advanced, but a DBML (Database Markup Language), an open-source DSL language designed to define and document database schemas and structures, tries to create a standard for determining these models. DBML is intended to be simple, consistent, and highly readable. Works well with dbdiagram.io.
Applying Data Modeling / Best Practices?
At the end of all we learned, how do you apply data modeling in practice? This series gave you a good introduction and some pointers to look for when you start with data modeling.
Below I found the Best Practices for Data Modeling, which guides you through some of the critical steps from one, designing the data model to eleven, verifying and testing the application of your data analytics:
- Design the data model for visualization
- Recognize the demands of the business and aim for relevant results
- Establish a single source of truth
- Start with simple data modeling and expand later
- Double-check each step of your data modeling
- Organize business queries according to dimensions, data, filters, and order
- Perform computations beforehand to prevent disputes with end customers
- Search for a relationship rather than just a correlation
- Using contemporary tools and methods
- Enhanced data modeling for improved business results
- Verify and test the application of your data analytics
Another excellent best practice for dbt can be found in How to Write a High-Quality Data Model From Start to Finish Using dbt by Madison or applying dimension model with Kimball and dbt in Building a Kimball dimensional model with dbt (GitHub) by Jonathan.
The Future of Data Modeling
As I’ve delved into the intricacies of data modeling in the previous parts of this series, it’s clear that we’re witnessing a revolution in the way we perceive, manage, and interact with data. The digital age is characterized by information overload, and data modeling provides the framework to harness this data and transform it into valuable insights.
Reflecting on the future of data modeling, I can’t help but feel a sense of optimism mixed with anticipation. It’s thrilling to envision a world where data-driven decisions are the norm rather than the exception, and I genuinely believe we’re on the right track.
Emerging technologies like AI and machine learning promise to streamline further and automate the process of data modeling. There’s potential for AI to take on a more active role in data modeling, translating complex business logic into coherent data structures.
This vision, however, doesn’t mean we can become complacent. It’s more crucial than ever for data professionals to stay on top of evolving industry trends and techniques. And then, there’s the matter of the various data modeling tools available. The future will likely expand the number of open-source and proprietary options. But at the end of the day, selecting the right tool will always come down to your specific requirements, constraints, and the nature of your data.
We also must remember the importance of data architecture patterns. As data grows in volume and complexity, finding the most suitable architecture becomes increasingly critical. The choice between batch vs. streaming or data lake vs. data warehouse could significantly impact your data modeling efforts. So decisions around implementing a semantic layer, opting for a modern/open data stack, or navigating between centralized and decentralized patterns.
As I wrap up this series of data modeling, I encourage you all to keep learning, experimenting, and pushing the boundaries of what’s possible with your data. Remember, the essence of data modeling is simplicity, no matter how complex the underlying data might be. The future of data modeling is constant change. But one is inevitable; it will be critical for every company. I’m excited to see how the field evolves and how we, as data practitioners, continue to drive this evolution :).
Learning more about Data Modeling
Below are some resources and helpful comments I gathered from you all; thank you for the valuable feedback throughout writing this article.
- MongoDB Courses and Trainings, offering comprehensive guides to understanding and mastering MongoDB. Link
- Book: “Agile Data Warehouse Design: Collaborative Dimensional Modeling, from Whiteboard to Star Schema” by Lawrence Corr. This is a foundational text for understanding dimensional modeling in an agile context. Link
- Book: “Data Warehouse Blueprints: Business Intelligence in der Praxis” by Dani Schnider, Claus Jordan, Peter Welker, and Joachim Wehner. This is an excellent resource for German speakers. Link
- Book: “Data and Reality” - a timeless guide to data modeling, recommended by Jenna Jordan.
- Video: “Data Modeling in the Modern Data Stack” - a valuable resource for understanding the current state of data modeling. Link
- Article: “Introducing Entity-Centric Data Modeling for Analytics” on Preset - a good read for understanding an entity-centric approach to data modeling. Link
- Website: AgileData.io by Shane Gibson - a resource for reducing the complexity of managing data for Leaders, Analysts, and Consultants. Link
- Podcasts: “Shane Gibson - Making Data Modeling Accessible - The Joe Reis Show” on Spotify. Link
- Use well-defined ontologies that describe your business and relationships between components using common industry concepts, as suggested by Rayner Däppen. Link to comment
- Keep your data models updated, aligned, documented, validated, and verified with the business. This will ensure the models accurately reflect the current state of the company.
- Consider where to build the semantic/metrics layer to allow for fast, interactive analytics/dashboards and how to make it available to multiple tools to avoid various definitions. Link to comment
- From “Data Modeling in the Modern Data Stack” - computation is now the expensive part of data modeling, not storage. A hybrid approach is often used in modern data stacks to balance complexity, computational cost, data redundancy, and adaptability.
Originally published at Airbyte.com