Apache Parquet has become the de-facto standard for storing data used in analytics workloads, and has seen very broad adoption as a free and open-source storage format. When used as the underlying storage layer for Apache Iceberg, Parquet is also a foundational building block in modern lakehouse architectures, which enable warehouse-like capabilities on cost effective object storage.
Basic Definition: What is Apache Parquet?
Apache Parquet is a data file format designed to support fast and efficient data processing and retrieval for complex data, with several notable characteristics:
1. Columnar: Unlike row-based formats such as CSV or Avro, Apache Parquet is column-oriented. Let’s take a moment to explain what this means. Data is often generated and more easily conceptualized in rows. Business users might be used to thinking of data in terms of Excel spreadsheets, where we can see all the data relevant to a specific record in one neat and organized row; however, for large-scale analytical querying, columnar storage comes with significant advantages with regards to cost and performance.
2. Open-source: Parquet is free to use and open sourced under the Apache Hadoop license, and is compatible with most Hadoop and other modern data processing frameworks.
3. Self-describing: In addition to data, a Parquet file contains metadata including schema and structure. Each file stores both the data and the standards used for accessing each record – making it easier to decouple services that write, store, and read Parquet files.
Advantages of Parquet Columnar Storage – Why Should You Use It?
Caracteristicile menționate ale formatului de fișier Apache Parquet generează mai multe beneficii distincte atunci când vine vorba de stocarea și analiza unor volume mari de date. Să le analizăm pe câteva dintre ele mai în detaliu.
Flexible and Efficient Compression
File compression is the act of taking a file and making it smaller. In Parquet, compression is performed column by column and it is built to support flexible compression options and extendable encoding schemas per data type – e.g., different encoding can be used for compressing integer and string data.
Parquet data can be compressed using these encoding methods:
- Dictionary encoding: this is enabled automatically and dynamically for data with a small number of unique values.
- Bit packing: stocarea numerelor întregi se face, de obicei, cu 32 sau 64 de biți dedicați pentru fiecare număr întreg. Aceasta permite stocarea mai eficientă a numerelor întregi mici.
- Run length encoding (RLE): when the same value occurs multiple times, a single value is stored once along with the number of occurrences. Parquet implements a combined version of bit packing and RLE, in which the encoding switches based on which produces the best compression results.
Enabling High-Performance Querying
As opposed to row-based file formats like CSV, Parquet is optimized for performance. When running queries on your Parquet-based file-system, you can focus only on the relevant data very quickly. Moreover, the amount of data scanned will be significantly smaller and will result in less I/O usage.
To understand this, let’s look a bit deeper into how Parquet files are structured.
As we mentioned above, Parquet is a self-describing format, so each file contains both data and metadata. Parquet files are composed of row groups, header and footer. Each row group contains data from the same columns. The same columns are stored together in each row group:
This structure is well-optimized both for fast query performance, as well as low I/O (minimizing the amount of data scanned). For example, if you have a table with 1000 columns, you will usually only query using a small subset of columns. Using Parquet files will enable you to fetch only the required columns and their values, load those in memory and answer the query. If a row-based file format like CSV was used, the entire table would have to have been loaded in memory, resulting in increased I/O and worse performance.
Support for Schema Evolution
When using columnar file formats like Parquet, users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. In these cases, Parquet supports automatic schema merging among these files.
Non-proprietary Storage Reduces Vendor Lock-in
As we’ve mentioned above, Apache Parquet is part of the open-source Apache Hadoop ecosystem. Development efforts around it are active, and it is being constantly improved and maintained by a strong community of users and developers.
Storing your data in open formats means you avoid vendor lock-in and increase your flexibility, compared to proprietary file formats used by many modern high-performance data warehouses. This means you can use a broad range of query engines, within the same data lake or a lakehouse architecture rather than being tied down to a specific data warehouse vendor.
Designed for Modern Analytical Workloads
Complex data such as logs and event streams would need to be represented as a table with hundreds or thousands of columns, and many millions of rows. Storing this table in a row-based format such as CSV would mean:
- Queries will take longer to run since more data needs to be scanned, rather than only querying the subset of columns we need to answer a query (which typically requires aggregating based on dimension or category)
- Storage will be more costly since CSVs are not compressed as efficiently as Parquet
Parquet provides better compression and improved performance out-of-the-box, and as we’ve seen above, they enable you to query data vertically – column by column.
Apache Parquet Use Cases – When Should You Use It?
While this isn’t a comprehensive list, a few telltale signs that you should be storing data in Parquet include:
- When you’re working with very large amounts of data. Parquet is built for performance and effective compression. Various benchmarking tests that have compared processing times for SQL queries on Parquet vs formats such as Avro or CSV (including the one described in this article, as well as this one), have found that querying Parquet results in significantly speedier queries.
- When your full dataset has many columns, but you only need to access a subset. Due to the growing complexity of the business data you are recording, you might find that instead of collecting 20 fields for each data event you’re now capturing 100+. While this data is easy to store in object storage such as Amazon S3, querying it will require scanning a significant amount of data if stored in row-based formats. Parquet’s columnar and self-describing nature allows you to only pull the required columns needed to answer a specific query, reducing the amount of data processed.
- When you want multiple services to consume the same data from object storage. While database vendors prefer you store your data in a proprietary format that only their tools can read, modern data architecture is biased towards decoupling storage from compute. If you want to work with multiple analytics services to answer different use cases, you should store data in Parquet.
Parquet vs ORC
Apache Parquet and Optimized Row Columnar (ORC) are two popular big data file formats. Both have unique advantages depending on your use case:
- ORC offers better write efficiency: ORC is better suited for write-heavy operations due to its row-based storage format. It provides better writing speeds when compared to Parquet, especially when dealing with evolving schema.
- Parquet is better suited for reading data. Parquet excels in write-once, read-many analytics scenarios, offering highly efficient data compression and decompression. It supports data skipping, which allows for queries to return specific column values while skipping the entire row of data, leading to minimized I/O. This can make ORC useful in scenarios where you have a large number of columns in the dataset, and a need to access only specific subsets of data.
- Compatibility: ORC is highly compatible with the Hive ecosystem, providing benefits like ACID transaction support when working with Apache Hive. However, Parquet offers broader accessibility, supporting multiple programming languages like Java, C++, and Python, making it usable in almost any big data setting. It is also used across multiple query engines including both open-source options such as Presto, Trino, and Flink; as well as modern data platforms such as Amazon Athena, Snowflake, Databricks and Google BigQuery.
- Compression: Both ORC and Parquet offer multiple compression options and support schema evolution. However, Parquet is often chosen over ORC when compression is the primary criterion, as it results in smaller file sizes with extremely efficient compression and encoding schemes. It can also support specific compression schemes on a per-column basis, further optimizing stored data.
Parquet, Iceberg, and the Lakehouse
Parquet is an analytics-optimized file format for storage – but at the end of the day, these are still just data files stored on object storage. To make this useful in analytics and AI workloads, organizations need to build a data management, governance, and indexing layer which allows ubiquitous access to data (similar to what’s available natively in. data warehouse). This is where table formats like Apache Iceberg come into play, creating a powerful combination that forms the backbone of open akehouse architectures.
Apache Iceberg provides a management layer on top of data lake/ lakehouse storage, adding database-like capabilities such as ACID transactions, schema evolution, and time travel to data stored in cost-effective cloud object storage. The underlying storage for Iceberg will most commonly be Parquet, although columnar format such as ORC are also supported.
Parquet and Iceberg are complementary technologies. Think of Parquet as handling the “how” of data storage—defining how bytes are organized and compressed—while Iceberg manages the “what” of data organization—tracking schemas, partitions, and metadata across collections of Parquet files.
Both Parquet and Iceberg are open technologies, which makes them a good choice for organizations looking to avoid vendor lock-in. You can query Parquet-based Iceberg tables using any compatible engine – such as Snowflake, Apache Spark, Trino, Presto, or Amazon Athena. This interoperability allows data teams to choose the best tools for each use case while working with a unified data layer.
Together, Parquet and Iceberg enable organizations to adopt an open lakehouse architecture which dramatically reduces costs compared to traditional data warehouses while supporting both real-time streaming and batch analytics workloads. The result is a flexible, high-performance architecture that scales with business needs and adapts to evolving data requirements, built on open standards that ensure long-term viability and platform independence.
Next Step: Go From Parquet Storage to Open Data Lakehouse with Qlik
Qlik Open Lakehouse is a fully managed capability within Qlik Talend Cloud that makes it easy, effortless, and cost-effective for users to ingest, process, and optimize large amounts of data in Apache Iceberg-based lakehouses.
With Qlik Open Lakehouse, you can ingest streaming or batch data in Apache Parquet format on Amazon S3 – and automatically load and optimize data directly in Apache Iceberg tables with just a few clicks. Qlik‘s Adaptive Iceberg Optimizer continuously manages and optimizes your Iceberg tables to ensure high performance and low costs, regardless of which query engine you are using downstream.
Article source: https://www.qlik.com/blog/
For information about Qlik™, click here: qlik.com.
For specific and specialized solutions from QQinfo, click here: QQsolutions.
In order to be in touch with the latest news in the field, unique solutions explained, but also with our personal perspectives regarding the world of management, data and analytics, click here: QQblog !
