What is Apache ORC?

Apache ORC (Optimized Row Columnar) is a highly efficient, open-source columnar storage format designed for massive analytical workloads in the Hadoop ecosystem. While Apache Parquet grew largely out of the Apache Spark and Impala communities, ORC was created explicitly to accelerate the original behemoth of big data analytics: Apache Hive.

Before the introduction of ORC in 2013, Hive queried data stored in inefficient text formats like CSV or RCFile. These formats caused immense CPU and disk I/O bottlenecks. The introduction of ORC dramatically reduced the size of Hadoop clusters by compressing data significantly better than older formats and accelerating Hive query speeds by orders of magnitude.

The Architectural Layout of ORC

Like Parquet, ORC is a columnar format. It groups values of the same column together to optimize analytical queries that only require a subset of the total columns. However, ORC employs a highly specific internal structure engineered for maximum read efficiency.

Stripes and Streams

An ORC file is physically divided into massive chunks called Stripes (typically 250MB in size). A Stripe is large enough to ensure highly efficient, contiguous disk reads from slow spinning hard drives (which were standard in legacy Hadoop environments).

Inside each Stripe, the data is further divided into Streams. A column in ORC is not stored as a single monolithic block. Instead, it is separated into distinct streams based on data types. For instance, an Integer column might be split into a Data Stream (containing the actual numbers) and a Present Stream (a highly compressed bitmask indicating exactly which rows contain NULL values). This intricate separation allows the query engine to completely avoid processing NULL values during complex aggregations.

Aggressive Compression and Indexing

The primary architectural differentiator for ORC is its intense focus on minimizing file size. In many benchmarks, ORC provides slightly better raw compression ratios than Parquet, making it highly attractive for organizations storing immense historical archives.

Type-Specific Encoding

ORC utilizes distinct encoding strategies tailored precisely to the data type. It does not treat all data as generic bytes. It uses Dictionary Encoding for repetitive strings, Delta Encoding for monotonically increasing integers (like timestamps or IDs), and highly optimized bit-packing. By applying the exact mathematical compression algorithm suited to the specific data structure, ORC minimizes the physical footprint aggressively.

Lightweight Indexes

Every ORC file contains incredibly detailed, built-in metadata. At the end of the file, the File Footer stores the schema and aggregate statistics.

Crucially, ORC maintains a Stripe Footer and Row Group Indexes. By default, ORC generates statistics (MIN, MAX, SUM, and COUNT) for every 10,000 rows. When a query engine like Trino executes a SQL filter (e.g., WHERE age > 65), it evaluates these lightweight indexes to precisely skip thousands of irrelevant rows without ever decompressing the actual Data Streams. This Predicate Pushdown capability is fundamental to modern data lake performance.

ORC vs Parquet in the Modern Lakehouse

Today, Apache ORC and Apache Parquet share nearly identical use cases. Both are highly optimized columnar formats supporting deep compression and predicate pushdown.

The choice between them historically depended entirely on the computation engine. Organizations deeply entrenched in the Apache Hive and Hortonworks ecosystem standardized on ORC, as Hive was heavily optimized to read it natively. Conversely, organizations standardizing on Apache Spark heavily favored Parquet, as Spark’s Catalyst Optimizer and Tungsten Engine were explicitly engineered around Parquet’s structure.

In the modern Open Data Lakehouse era, the ecosystem has largely consolidated around Parquet, primarily because dominant open table formats like Delta Lake and Apache Iceberg adopted Parquet as their default underlying storage standard. However, Iceberg fully supports ORC as a valid physical data file format, ensuring that legacy Hadoop organizations migrating to modern architectures can leverage their existing ORC datasets without forcing massive data rewrites.

Summary of Technical Value

Apache ORC fundamentally solved the storage bottleneck of the original Hadoop era. By providing a highly intricate columnar structure featuring type-specific compression and lightweight indexing, it allowed legacy engines to perform fast analytics over massive historical datasets. It remains a powerful, deeply optimized storage format for organizations managing massive-scale data lakes.

Learn More

To learn more about the Data Lakehouse, read the book “Lakehouse for Everyone” by Alex Merced. You can find this and other books by Alex Merced at books.alexmerced.com.