Data Lakehouse

This article provides a comprehensive perspective on Lakehouses, its essential building blocks, high-level architecture and key considerations for building your own open data lakehouse.

What is a Data Lakehouse?

A data lakehouse is a data management architecture that combines key capabilities of data lakes and data warehouses into a unified platform. It brings the benefits of a data lake, such as low-cost storage and broad data access, and the benefits of a data warehouse, such as data structure, performance, and management features. Lakehouses are increasingly built utilizing open data and open table formats such as Apache Iceberg, Hudi and Delta tables to provide flexibility and interoperability.

What is Apache Iceberg?

Apache Iceberg is an open table format designed to manage large-scale data lakehouses and enable high-performance analytics on open data formats. It allows files to be treated as logical table entities, making it well-suited for lakehouse architectures.
With Iceberg, users can store data in cloud object stores and process/query it utilizing multiple different engines, offering flexibility and interoperability across platforms. 
Iceberg supports some key features such as ACID compliance, dynamic partitioning, time travel, and schema evolution, ensuring high performance and data integrity. 
Additionally, Apache Iceberg fosters a strong open-source community, making it a reliable, versatile and open solution for modern data management needs.

More articles about Apache Iceberg here:

Introduction to Iceberg Lakehouses

For a presentation on Iceberg Lakehouse, check this link: https://videos.qlik.com/watch/1RTmeTkbJAUaq8wjokKne1.

Data Lakehouse Features and Benefits

The lakehouse data platform ensures that data analysts and AI engineers can utilize the most recent and the broadest data sets towards business intelligence, analytics, Gen AI and machine learning. And having one system to manage simplifies the enterprise data infrastructure and allows analysts and data scientists to work more efficiently. 

Data Lakehouse vs Data Warehouse vs Data Lake

Historically, we’ve had two primary options for a big data repository: data lake or data warehouse. To support analytics, AI, data science and machine learning, it’s likely that you’ve had to maintain both of these options simultaneously and link the systems together. This often leads to data duplication, security challenges and additional infrastructure expense. Data lakehouses can help overcome these issues.

Data Lakehouse Architecture

A data lakehouse typically consists of six key layers as depicted below: the ingestion layer, storage layer, physical data layer, metadata layer, governance/catalog layer, and a query/processing layer.

Components of a Lakehouse Architecture

The section below dives into the details of each of these layers to understand the Lakehouse architecture better. 

Ingestion Layer: Offers capabilities to ingest data from various sources into the lakehouse, including batch and real-time data pipelines using change data capture (CDC) or streaming. Should offer capabilities to easily ingest and load high volumes of data in real time to the lakehouse with just a few clicks. 
Storage Layer: Stores all types of data (structured, semi-structured, unstructured) in a single unified platform, often using cloud-based object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage. Data can be stored in a raw, transformed or clean business ready buckets with necessary transformation and cleansing. 
Physical Data Layer: Open file formats define how a lakehouse writes and reads data. Open file formats focus on efficient storage and compression of data and significantly impacts speed and performance. They define how the raw bytes representing records and columns are organized and encoded on disk or in a distributed file system such as Amazon S3. Some of the more common open file formats for lakehouses include Apache Parquet, Apache Avro and ORC. 
Table Formats/Metadata Layer: The differentiating factor between a Data Lake and a Lakehouse is a table format or a table metadata layer. It provides an abstraction layer on top of the physical data layer to facilitate organizing, querying and updating data. Common open table formats include Apache Iceberg, Apache Hudi and Delta Tables, that store the information about which objects are part of a table and enables SQL engines to see a collection of files as a table with rows and columns that can be queried and updated transactionally.
Catalog layer: A catalog refers to a central registry within the lakehouse framework that tracks and manages the metadata of the tables underneath. It essentially acts as a source of truth for where to find the current state of a table, including its schema, partitions, and data locations, allowing different compute engines to access and manipulate lakehouse tables consistently. Examples include AWS Glue catalog, Snowflake open catalog, Polaris, Unity Catalog, Hive Catalog, Project Nessie, and REST catalogs. 
Query/ Compute layer: Provides processing power to analyze and query data stored in the storage layer. It may also utilize distributed processing engines like Apache Spark, Presto, or Hive or other Cloud data engines to handle large datasets efficiently. This layer enables users to access and analyze data from the lakehouse using diverse tools and applications like query engines, BI dashboards, data science platforms, and SQL clients.

Article source: www.qlik.com.

For information about Qlik™, click here: qlik.com.
For specific and specialized solutions from QQinfo, click here: QQsolutions.
In order to be in touch with the latest news in the field, unique solutions explained, but also with our personal perspectives regarding the world of management, data and analytics, click here: QQblog !