Search

Search IconIcon to open search

Data Virtualization / Data Federation

Last updated by Simon Späti

Data Virtualization is particularly useful when you have multiple source systems from different technologies, all with relatively fast response times. If you don’t run many operational applications, you might consider Data Virtualization. This approach allows you to avoid moving and copying data around or pre-aggregating it. Instead, you create a Semantic Layer where you build your business models (like cubes), and queries to this data virtualization layer are then directed to the appropriate data source.

For example, Dremio uses Apache Arrow technology, which caches and optimizes a lot in-memory, resulting in impressively fast response times.

# Tools

# Data Virtualization vs. Data Federation

Data Virtualization is the broader concept that includes:

  • Creating an abstraction layer over multiple data sources
  • Providing a unified view of data
  • Managing query optimization and execution
  • Handling caching and performance optimization
  • Creating semantic models and business views

Data Federation is specifically the technique within data virtualization that:

  • Handles the distributed query execution
  • Manages connections to different data sources
  • Routes queries to appropriate sources

# Alternatives or When Not Use a Data Lake or Lakehouse

Data virtualization can be an alternative strategy for addressing data inconsistencies and reducing data governance costs. Other use cases include:

  1. Rapid prototyping for batch data movement
  2. Self-service analytics via a virtual sandbox
  3. Compliance with regulatory constraints on data movement

This concept is often mentioned in conjunction with Data Lakes, as you can easily create one by connecting various sources in a single, large Data Lake. Instead of moving data from point A to B, you have a Semantic Layer where you create your business logic, and queries are pushed down to the source systems only when needed. Pre-caching mechanisms using Apache Arrow, an optimized in-memory technology, can deliver ultra-fast response times. However, this approach does require significant memory resources.

“You don’t need to be Google or Netflix to generate (maybe not a Petabyte) but a Terabyte of useless data with destructive patterns. The same goes for wasted computing. You can have the best semantic view, but if you don’t mind the technical layer (e.g: leverage materialized view, be mindful of the refresh time), the cost can go bananas.” - Mehdi Ouazza on “semantic view” vs “technical layer” and computing Link

Dremio takes this concept further by building entire Data Lakehouses on top of it. For more information, see Build an open data lakehouse with Dremio and Airbyte. They have an interesting concept called Data Reflections (see RW Getting Started With Data Reflections Dremio).

# Virtualizations and comparison to Data Warehouse/Lake

For more information on this topic, see Data Virtualization vs Data Warehouse.

# Arisen Questions


Origin: OLAP, what’s coming next? | ssp.sh
References: