Search
Data Virtualization / Data Federation
Data Virtualization is particularly useful when you have multiple source systems from different technologies, all with relatively fast response times. If you don’t run many operational applications, you might consider Data Virtualization. This approach allows you to avoid moving and copying data around or pre-aggregating it. Instead, you create a Semantic Layer where you build your business models (like cubes), and queries to this data virtualization layer are then directed to the appropriate data source.
For example, Dremio uses Apache Arrow technology, which caches and optimizes a lot in-memory, resulting in impressively fast response times.
# Tools
- Dremio → Apache Arrow in-memory Technology
- Cisco Data Virtualisation (previously called Composite Software, now TIBCO)
- Denodo Platform for Data Virtualisation
- Informatica Data Virtualisation
- IBM Big SQL
- Incorta
# Data Virtualization vs. Data Federation
Data Virtualization is the broader concept that includes:
- Creating an abstraction layer over multiple data sources
- Providing a unified view of data
- Managing query optimization and execution
- Handling caching and performance optimization
- Creating semantic models and business views
Data Federation is specifically the technique within data virtualization that:
- Handles the distributed query execution
- Manages connections to different data sources
- Routes queries to appropriate sources
# Alternatives or When Not Use a Data Lake or Lakehouse
Data virtualization can be an alternative strategy for addressing data inconsistencies and reducing data governance costs. Other use cases include:
- Rapid prototyping for batch data movement
- Self-service analytics via a virtual sandbox
- Compliance with regulatory constraints on data movement
This concept is often mentioned in conjunction with Data Lakes, as you can easily create one by connecting various sources in a single, large Data Lake. Instead of moving data from point A to B, you have a Semantic Layer where you create your business logic, and queries are pushed down to the source systems only when needed. Pre-caching mechanisms using Apache Arrow, an optimized in-memory technology, can deliver ultra-fast response times. However, this approach does require significant memory resources.
“You don’t need to be Google or Netflix to generate (maybe not a Petabyte) but a Terabyte of useless data with destructive patterns. The same goes for wasted computing. You can have the best semantic view, but if you don’t mind the technical layer (e.g: leverage materialized view, be mindful of the refresh time), the cost can go bananas.” - Mehdi Ouazza on “semantic view” vs “technical layer” and computing Link
Dremio takes this concept further by building entire Data Lakehouses on top of it. For more information, see Build an open data lakehouse with Dremio and Airbyte. They have an interesting concept called Data Reflections (see RW Getting Started With Data Reflections Dremio).
# Virtualizations and comparison to Data Warehouse/Lake
For more information on this topic, see Data Virtualization vs Data Warehouse.
# Arisen Questions
- What’s the difference between a Semantic Layer and Data Virtualization?
- How is a Data Lakehouse considered a semantic layer (as Dremio calls itself a lakehouse platform), yet has many semantic layer features? Are they essentially the same?
- The differences are discussed in Alternatives or When Not Use a Data Lake or Lakehouse
Origin:
OLAP, what’s coming next? | ssp.sh
References: