Today Apache ORC became a top level project at the Apache Software
Foundation. This step represents a major step forward for the project,
and is representative of its momentum.
Back in January 2013, we created ORC files as part of the initiative
to massively speed up Apache Hive and improve the storage efficiency
of data stored in Apache Hadoop. We added it as a feature of Hive for
two reasons:
- To ensure that it would be well integrated with Hive
- To ensure that storing data in ORC format would be as simple as
stating “stored as ORC” to your table definition.
In the last two years, many of the features that we’ve added to Hive,
such as vectorization, ACID, predicate push down and LLAP, support ORC
first, and follow up with other storage formats later.
The growing use and acceptance of ORC has encouraged additional Hadoop
execution engines, such as Apache Pig, Map-Reduce, Cascading, and
Apache Spark to support reading and writing ORC. However, there are
concerns that depending on the large Hive jar that contains ORC pulls
in a lot of other projects that Hive depends on. To better support
these non-Hive users, we decided to split off from Hive and become a
separate project. This will not only allow us to support Hive, but
also provide a much more streamlined jar, documentation and help for
users outside of Hive.
Although Hadoop and its ecosystem are largely written in Java, there
are a lot of applications in other languages that would like to
natively access ORC files in HDFS. Hortonworks, HP, and Microsoft are
developing a pure C++ ORC reader and writer that enables C++
applications to read and write ORC files efficiently without
Java. That code will also be moved into Apache ORC and released
together with the Java implementation.