Writing a Spark DataFrame to ORC files

By Kit

Mon, Dec 12, 2016

Spark includes the ability to write multiple different file formats to HDFS. One of those is ORC which is columnar file format featuring great compression and improved query performance through Hive.

You’ll need to create a HiveContext in order to write using the ORC data source in Spark. First, create some properties in your pom.xml:

<properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  <scala.core>2.10</scala.core>
  <spark.version>1.6.1</spark.version>
</properties>

Include spark-hive in addition to your other project dependencies:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_${scala.core}</artifactId>
  <version>${spark.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql_${scala.core}</artifactId>
  <version>${spark.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-hive_${scala.core}</artifactId>
  <version>${spark.version}</version>
</dependency>

Then in your code:

// create a new hive context from the spark context
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
// create the data frame and write it to orc
// output will be a directory of orc files
val df = hiveContext.createDataFrame(rdd)
df.write.mode(SaveMode.Overwrite).format("orc")
 .save("/tmp/myapp.orc/")

If you want your table to be accessible via Hive the directory could be the location of the internal hive table like /apps/hive/warehouse/some.db/some_table/ or somewhere that a Hive external table points to.

Alternatively, if you want to handle the table creation entirely within Spark with the data stored as ORC, just register a Spark SQL temp table and run some HQL to create the table.

df.registerTempTable("my_temp_table")
hiveContext.sql("CREATE TABLE new_table_name STORED AS ORC  AS SELECT * from my_temp_table")

Sources:

Spark