Run Spark Shell with a Different User Directory

I needed to be able to the run Spark’s shell using a different home directory due to a permissions issue with my normal HDFS home directory. A little known feature with spark is that it allows overriding Hadoop configuration properties!

Apache Storm and ulimits

Running Apache Storm in production requires increasing the nofile and nproc default ulimits.

ClassCastException submitting Spark apps to HDInsight

I recently ran into an issue submitting Spark applications to a HDInsight cluster. The job would run fine until it attempted to use files in blob storage and then blow up with an exception: java.lang.ClassCastException: org.apache.xerces.parsers.XIncludeAwareParserConfiguration cannot be cast to org.apache.xerces.xni.parser.XMLParserConfiguration.

Creating an Avro table in Hive automatically

My goal was to create a process for importing data into Hive using Sqoop 1.4.6. It needs to be simple (or easily automated) and use a robust file format.

Importing data from a Relational Database into Hive should be easy. It’s not. When you’re first getting started there are a lot of snags along the way including configuration issues, missing jar files, formatting problems, schema issues, data type conversion, and … the list goes on. This post shines some light on a way to use command line tools to import data as Avro files and create Hive schemas plus solutions for some of the problems along the way.

Writing a Spark DataFrame to ORC files

Spark includes the ability to write multiple different file formats to HDFS. One of those is ORC which is columnar file format featuring great compression and improved query performance through Hive.

Using HBase within Storm

There is a lot of documentation around Apache Storm and Apache HBase but not so much about how to use the hbase-client inside of storm. In this post, I’ll outline:

  1. Information about my dev environment
  2. Setting up your Storm project to use the HBase client
  3. Managing connections to HBase in Storm
  4. Reading one row (Get)
  5. Reading many rows (Scan)
  6. Writing one row (Put)
  7. Writing many rows in a batch of Puts

Please note, this post assumes you already are comfortable with Storm and HBase terminology. If you are just starting out with Storm, take a look at my example project on GitHub: storm-stlhug-demo.

Also, an option to consider when writing to HBase from storm is storm-hbase and it is a great way to start streaming data into hbase. However, if you need to write to multiple tables or get into more advanced scenarios you will need to understand how to write your own HBase bolts.

Example running MapReduce on tez and parsing arguments

I ran into some trouble executing a simple MapReduce program on TEZ. I kept reading about the special “” parameter you could pass to your MR job to make it automatically switch frameworks without modifying the configs for your entire cluster but couldn’t get the argument to be parsed correctly.

After doing some research, I found that your MapReduce class must extend Configured and implement Tool. In addition to parsing the generic arguments correctly, you also get arguments parsed for you automatically.

Apache Zeppelin on the Hortonworks 2.3 sandbox

A few notes from playing with zeppelin on the Hortonworks HDP 2.3 sandbox.

Getting started with Storm: Logging

Logging within storm uses Simple Logging Facade for Java (SLF4J).

Tick tuples within Storm

Tick tuples are useful when you want to execute some logic within your bolts on a schedule. For example, refreshing a cache every 5 minutes. Within your bolt, first enable receiving the tick tuples and configure how often you want to receive them: @Override public Map<String, Object> getComponentConfiguration() { // configure how often a tick tuple will be sent to our bolt Config conf = new Config(); conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 300); return conf; } Next create the isTickTuple helper method to determine whether or not we’ve received a tick tuple: