ūü߆ Second Brain

Search

Search IconIcon to open search

XTable (formerly: OneTable)

Last updated Mar 17, 2024

Today 2023-02-02 Onehouse announced Onetable. Onehouse customers can now query their Apache Hudi tables as Delta Lake and Apache Iceberg unlocking native accelerations & advanced features inside Databricks and snowflake.

OneTable is an omni-directional converter for Data Lake Table Format that facilitates interoperability across data processing systems and query engines. Currently, OneTable supports widely adopted open-source table formats such as Apache Hudi, Apache Iceberg, and Delta Lake.

OneTable simplifies data lake operations by leveraging a common model for table representation. This allows users to write data in one format while still benefiting from integrations and features available in other formats. For instance, OneTable enables existing Hudi users to seamlessly work with Databricks’s Photon Engine or query Iceberg Tables with Snowflake. Creating transformations from one format to another is straightforward and only requires the implementation of a few interfaces, which we believe will facilitate the expansion of supported source and target formats in the future.

More on Announcing Onetable.

The code: GitHub - onetable-io/onetable: OneTable is an omni-directional converter for table formats that facilitates interoperability across data processing systems and query engines.

# History

Was accepted 2024-03-14 by Apache and is now GitHub ( Website).

# Summary

Onehouse customers can now query their Hudi tables as an Apache Iceberg and/or Delta Lake table unlocking native performance optimizations from popular cloud query engines to cutting edge open source projects.

At the base of a data platform’s hierarchy of needs sits the fundamental need to ingest, store, manage, and transform data. Onehouse provides this foundational data infrastructure as a service to ingest and manage data in our customers lakes. As these data lakes continue to grow in terms of size and variety within an organization, it becomes imperative to decouple your foundational data infrastructure and compute engines that process data. Different teams can leverage specialized compute frameworks, e.g. Apache Flink (Stream Processing), Ray (Machine Learning), or Dask (Python data processing), to solve the problems important to their organization. Decoupling allows developers to use one or more of these frameworks on a single instance of their data stored in an open format without the tedium of copying it into another service where compute and storage are tightly coupled. Apache Hudi, Apache Iceberg, and Delta Lake have emerged as the leading open-source projects providing this decoupled storage layer with a powerful set of primitives that provide transaction and metadata (popularly referred to as table formats) layers in cloud storage, around open file formats like Apache Parquet.

# OneTable Sync: Working mechanism

If we step back and try to understand Hudi, Iceberg, and Delta Lake, fundamentally, they are similar. When the data is written to a distributed file system, these three formats have the same structure. They have a data layer, a set of Apache Parquet files, and a metadata layer on top of these data files, providing the necessary abstraction. Here is what it looks like at a high level.

Structure of the 3 table formats

The OneTable sync process translates table metadata by leveraging the existing APIs of the table formats. It reads the current metadata for a source table and writes out metadata for one or more target table formats. The resulting metadata is stored under a dedicated directory in the base path of your table, such as _delta_log for Delta Lake, metadata for Iceberg, and .hoodie for Hudi. This enables your existing data to be interpreted as if it were originally written using Delta, Hudi, or Iceberg.

For instance, if you are using Spark to read data, you can just use:

More of it in OneTable: Interoperability for Hudi, Iceberg, Delta | Medium.

High-level overview of OneTable

# How to: Sync Process

To start using OneTable, you will need to clone the GitHub  repository in your environment and build the required jars from the source code. Here is how you can do so, using Maven. You can also follow the official  documentation for a more detailed guide.


Source


Origin: LinkedIn // Twitter
References:
Created 2023-05-24