Search

Search IconIcon to open search

Data Engineering Lifecycle

Last updated by Simon Späti

In today’s landscape, a data engineer is pivotal in overseeing the entire data engineering process. This involves gathering data from diverse sources and ensuring its availability for downstream applications. A deep understanding of the various stages in the data engineering lifecycle is essential. Additionally, a data engineer must possess the skill to evaluate data tools effectively, considering various aspects such as cost, speed, flexibility, scalability, user-friendliness, reusability, and interoperability.


Illustration of the data engineering lifecycle, from Fundamentals of Data Engineering

Another perspective can be seen in this Tweet:

For more insights, see Data Engineering Architecture, such as the one from A16z.

Case Study: Open Data Stack Project

The Open Data Stack project exemplifies practical application, incorporating key lifecycle components like ingestion, transformation, analytics, and machine learning.

Further reading: The Evolution of The Data Engineer: Past, Present & Future.

# Undercurrents

These are the foundational elements of the lifecycle, pervasive throughout its various stages: security, data management, DataOps, data architecture, orchestration, and software engineering. The lifecycle cannot function effectively without these integral undercurrents.

Here are the above core principles of the engineering lifecycle, added with my own thoughts or features.

# Data Lifecycle

Related is the Data Lifecycle and Data Canvas.

# Let’s not repeat ourselves

With the hype cycle, we have a tendency to repeat ourselves with ever-new tech.

But let’s integrate new data tech into the engineering lifecycle instead of creating new siloed work.

The picture below illustrates, with the chasm hype cycle, the engineering behavior is to skip fundamentals, adopting ever-new tools instead of sustaining architectural patterns that work.

graph LR
    subgraph "Engineering Behavior"
        P1[Problem Discovery] -->|"Search for Quick Solution"| P2[Build/Adopt New Tool]
        P2 -->|"Technical Debt Accumulates"| P3[Maintenance Challenges]
        P3 -->|"Research Existing Solutions"| P4[Discovery of Established Patterns]
        P4 -->|"Integration & Optimization"| P5[Sustainable Architecture]
        
        P6[NIH Syndrome] -.->|"Not Invented Here"| P2
        P7[Learning Curve Avoidance] -.->|"Skip Fundamentals"| P2
    end
    
    
    classDef vectorTech fill:#e1f5fe,stroke:#0277bd,stroke-width:1px
    classDef engBehavior fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px
    classDef convergent fill:#fff3e0,stroke:#e65100,stroke-width:1px
    classDef connection stroke:#999,stroke-width:1px,stroke-dasharray: 5 5
    classDef convergentLine stroke:#e65100,stroke-width:2px
    
    class V2,V3,V6 vectorTech
    class P1,P2,P3,P4,P5,P6,P7 engBehavior
    class C1,C2,C3,C4,C5,C6 convergent

Origin:
References:
Created 2022-12-21