Monday, November 7, 2011

Hadoop cluster with Ubuntu server and Juju

A while back I started experimenting with Juju and was intrigued by the notion of services instead of machines.

A bit of background on Juju from their website:

  • Formerly called Ensemble, juju is DevOps DistilledTM. Through the use of charms(renamed from formulas), juju provides you with shareable, re-usable, and repeatable expressions of DevOps best practices. You can use them unmodified, or easily change and connect them to fit your needs. Deploying a charm is similar to installing a package on Ubuntu: ask for it and it’s there, remove it and it’s completely gone.

I come from a DevOps background and know first hand the troubles and tribulations of deploying production services, webapps, etc.  One that's particularly "thorny" is hadoop.

To deploy a hadoop cluster, we would need to download the dependencies ( java, etc. ), download hadoop, configure it and deploy it.  This process is somewhat different depending on the type of node that you're deploying ( ie: namenode, job-tracker, etc. ).  This is a multi-step process that requires too much human intervention.  It is also a process that is difficult to automate and reproduce.  Imagine 10, 20 or 50 node cluster using this method.  It can get frustrating quickly and it is prone to mistake.

With this experience in mind ( and a lot of reading ), I set out to deploy a hadoop cluster using an Juju charm.

First things first, let's install Juju.  Follow the Getting Started documentation on the Juju site here.

According to the Juju documenation, we just need to follow some file naming conventions for what they call "hooks" ( executable scripts in your language of choice that perform certain actions ).  These "hooks" control the installation, relationships, start, stop, etc of your charm.  We also need to summarize the description of the formula in a file called metadata.yaml.  The metadata.yaml file describes the formula, it's interfaces, what it requires and provides among other things.  More on this file later when I show you the one for hadoop-master and hadoop-slave.

Armed with a bit of knowledge and a desire for simplicity, I decided to split the hadoop cluster in two:

  • hadoop-master (namenode and jobtracker )
  • hadoop-slave ( datanode and tasktracker )
I know this is not an all-encompassing list but, this will take care of a good portion of deployments and, the Juju charms are easy enough to modify that you can work your changes into them.

One of my colleagues, Brian Thomason did a lot of packaging for these charms so, my job is now easier.  The configuration for the packages has been distilled down to three questions:

  1. namenode ( leave blank if you are the namenode )
  2. jobtracker ( leave blank if you are the jobtracker )
  3. hdfs data directory ( leave blank to use the default: /var/lib/hadoop-0.20/dfs/data )
Due to the magic of Ubuntu packaging, we can even "preseed" the answers to those questions to avoid being asked about them ( and stopping the otherwise automatic process ). We'll use the utility debconf-set-selections for this.  Here is a piece of the code that I use to preseed the values in my charm:
  • echo debconf hadoop/namenode string ${NAMENODE}| /usr/bin/debconf-set-selections
  • echo debconf hadoop/jobtracker string ${JOBTRACKER}| /usr/bin/debconf-set-selections
  • echo debconf hadoop/hdfsdatadir string ${HDFSDATADIR}| /usr/bin/debconf-set-selections
The variable names should be self explanatory.  

Thanks to Brian's work, I now just have to install the packages ( hadoop-0.20-namenode and hadoop-0.20-jobtracker).  Let's put all of this together into a Juju charm.

  • Create a directory for the hadoop-master formula ( mkdir hadoop-master )
  • Make a directory for the hooks of this charm ( mkdir hadoop-master/hooks )
  • Let's start with the always needed metadata.yaml file ( hadoop-master/metadata.yaml ):
ensemble: formula
name: hadoop-master
revision: 1
summary: Master Node for Hadoop
description: |
  The Hadoop Distributed Filesystem (HDFS) requires one unique server, the
  namenode, which manages the block locations of files on the
  filesystem.  The jobtracker is a central service which is responsible
  for managing the tasktracker services running on all nodes in a
  Hadoop Cluster.  The jobtracker allocates work to the tasktracker
  nearest to the data with an available work slot.
provides:
  hadoop-master:
    interface: hadoop-master

  • Every Juju charm has an install script ( in our case: hadoop-master/hooks/install ).  This is an executable file in your language of choice that Juju will run when it's time to install your charm.  Anything and everything that needs to happen for your charm to install, needs to be inside of that file.  Let's take a look at the install script of hadoop-master:
#!/bin/bash
# Here do anything needed to install the service
# i.e. apt-get install -y foo  or  bzr branch http://myserver/mycode /srv/webroot


##################################################################################
# Set debugging
##################################################################################
set -ux
juju-log "install script"


##################################################################################
# Add the repositories
##################################################################################
export TERM=linux
# Add the Hadoop PPA
juju-log "Adding ppa"
apt-add-repository ppa:canonical-sig/thirdparty
juju-log "updating cache"
apt-get update


##################################################################################
# Calculate our IP Address
##################################################################################
juju-log "calculating ip"
IP_ADDRESS=`hostname -f`
juju-log "Private IP: ${IP_ADDRESS}"


##################################################################################
# Preseed our Namenode, Jobtracker and HDFS Data directory
##################################################################################
NAMENODE="${IP_ADDRESS}"
JOBTRACKER="${IP_ADDRESS}"
HDFSDATADIR="/var/lib/hadoop-0.20/dfs/data"
juju-log "Namenode: ${NAMENODE}"
juju-log "Jobtracker: ${JOBTRACKER}"
juju-log "HDFS Dir: ${HDFSDATADIR}"

echo debconf hadoop/namenode string ${NAMENODE}| /usr/bin/debconf-set-selections
echo debconf hadoop/jobtracker string ${JOBTRACKER}| /usr/bin/debconf-set-selections
echo debconf hadoop/hdfsdatadir string ${HDFSDATADIR}| /usr/bin/debconf-set-selections


##################################################################################
# Install the packages
##################################################################################
juju-log "installing packages"
apt-get install -y hadoop-0.20-namenode
apt-get install -y hadoop-0.20-jobtracker


##################################################################################
# Open the necessary ports
##################################################################################
if [ -x /usr/bin/open-port ];then
   open-port 50010/TCP
   open-port 50020/TCP
   open-port 50030/TCP
   open-port 50105/TCP
   open-port 54310/TCP
   open-port 54311/TCP
   open-port 50060/TCP
   open-port 50070/TCP
   open-port 50075/TCP
   open-port 50090/TCP
fi


  • There a few other files that we need to create ( start and stop ) to get the hadoop-master charm installed.  Let's see those files:
    • start
#!/bin/bash
# Here put anything that is needed to start the service.
# Note that currently this is run directly after install
# i.e. 'service apache2 start'

set -x
service hadoop-0.20-namenode status && service hadoop-0.20-namenode restart || service hadoop-0.20-namenode start
service hadoop-0.20-jobtracker status && service hadoop-0.20-jobtracker restart || service hadoop-0.20-jobtracker start

    • stop
#!/bin/bash
# This will be run when the service is being torn down, allowing you to disable
# it in various ways..
# For example, if your web app uses a text file to signal to the load balancer
# that it is live... you could remove it and sleep for a bit to allow the load
# balancer to stop sending traffic.
# rm /srv/webroot/server-live.txt && sleep 30

set -x
juju-log "stop script"
service hadoop-0.20-namenode stop
service hadoop-0.20-jobtracker stop

Let's go back to the metadata.yaml file and examin it in more detail:

ensemble: formula
name: hadoop-master
revision: 1
summary: Master Node for Hadoop
description: |
  The Hadoop Distributed Filesystem (HDFS) requires one unique server, the
  namenode, which manages the block locations of files on the
  filesystem.  The jobtracker is a central service which is responsible
  for managing the tasktracker services running on all nodes in a
  Hadoop Cluster.  The jobtracker allocates work to the tasktracker
  nearest to the data with an available work slot.
provides:
  hadoop-master:
    interface: hadoop-master

The emphasized section ( provides ) tells juju that this formula provides an interface named hadoop-master that can be used in relationships with other charms ( in our case we'll be using it to connect the hadoop-master with the hadoop-slave charm that we'll be writing a bit later ).  For this relationship to work, we need to let Juju know what to do ( More detailed information about relationships in charms can be found here ).

Per the Juju documentation, we need to name our relationship hooks hadoop-master-relation-joined  and it should also be an executable script in your language of choice.  Let's see what that file looks like:

#!/bin/sh
# This must be renamed to the name of the relation. The goal here is to
# affect any change needed by relationships being formed
# This script should be idempotent.

set -x

juju-log "joined script started"

# Calculate our IP Address
IP_ADDRESS=`unit-get private-address`

# Preseed our Namenode, Jobtracker and HDFS Data directory
NAMENODE="${IP_ADDRESS}"
JOBTRACKER="${IP_ADDRESS}"
HDFSDATADIR="/var/lib/hadoop-0.20/dfs/data"

relation-set namenode="${NAMENODE}" jobtracker="${JOBTRACKER}" hdfsdatadir="${HDFSDATADIR}"



juju-log "$JUJU_REMOTE_UNIT joined"

Your formula charm directory should now look something like this:
natty/hadoop-masternatty/hadoop-master/metadata.yamlnatty/hadoop-master/hooks/installnatty/hadoop-master/hooks/startnatty/hadoop-master/hooks/stopnatty/hadoop-master/hooks/hadoop-master-relation-joined
 This charm should now be complete...  It's not too exciting yet as it doesn't have the hadoop-slave counterpart to it but, it is a complete charm.

The latest version of the hadoop-master charm can be found here if you want to get it.

The hadoop-slave charm is almost the same as the hadoop-master charm with some exceptions.  Those I'll leave as an exercise for the reader.

The hadoop-slave charm can be found here if you want to get it.

Once you have both charm ( hadoop-master and hadoop-slave ) you can easily deploy your cluster by typing:

  • juju bootstrap   # ( creates/bootstraps the ensemble environment)
  • juju deploy --repository . local:natty/hadoop-master # ( deploys hadoop-master )
  • juju deploy --repository . local:natty/hadoop-slave # ( deploys hadoop-slave )
  • juju add-relation hadoop-slave hadoop-master # ( connects the hadoop-slave to the hadoop-master )
As you can see, once you have the charm written and tested, deploying the cluster is really a matter of a few commands.  The above example gives you one hadoop-master ( namenode, jobtracker ) and one hadoop-slave ( datanode, tasktracker ).

To add another node to this existing hadoop cluster, we add:

  • juju add-unit hadoop-slave # ( this adds one more slave )
Run the above command multiple times to continue to add hadoop-slave nodes to your cluster.

Juju allows you to catalog the steps needed to get your service/application installed, configured and running properly.  Once your knowledge has been captured in an Juju charm, it can be re-used by you or others without much knowledge of what's needed to get the application/service running.

In the DevOps world, this code re-usability can save time, effort and money by providing self contained charms that provide a service or application.

No comments:

Post a Comment