Kit Menke is a Data Engineer / Software Architect from St. Louis, MO USA. He works mostly with open source Big Data technologies like Apache Hadoop, Apache Storm, Apache HBase, Apache Hive, and Apache Spark.
He is a member of the St. Louis Hadoop User group and has spoken at multiple conferences (DataWorks Summit and StampedeCon).
I needed to be able to the run Spark’s shell using a different home directory due to a permissions issue with my normal HDFS home directory. A little known feature with spark is that it allows overriding Hadoop configuration properties!
Running Apache Storm in production requires increasing the nofile and nproc default ulimits.
I recently ran into an issue submitting Spark applications to a HDInsight cluster. The job would run fine until it attempted to use files in blob storage and then blow up with an exception:
java.lang.ClassCastException: org.apache.xerces.parsers.XIncludeAwareParserConfiguration cannot be cast to org.apache.xerces.xni.parser.XMLParserConfiguration.
My goal was to create a process for importing data into Hive using Sqoop 1.4.6. It needs to be simple (or easily automated) and use a robust file format.
Importing data from a Relational Database into Hive should be easy. It’s not. When you’re first getting started there are a lot of snags along the way including configuration issues, missing jar files, formatting problems, schema issues, data type conversion, and … the list goes on. This post shines some light on a way to use command line tools to import data as Avro files and create Hive schemas plus solutions for some of the problems along the way.