Sunday 10 January 2016

Hadoop & SSH

There were many excellent resources online that explain the installation of a single node pseudo-distributed Hadoop installation like here. But I saw that many of these existing instructions are for installing older versions of Hadoop and there has been some minor changes since then. In this post I will be explaining how I installed Hadoop 2.6.2 on Ubuntu 15.10 along with some cool stuff about SSH.

Install Java
Hadoop requires a working Java 1.5+ installation. Get it here.

Add a new user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps in keeping your original user account clean & keeps the hadoop user account secure.

This will add the user "hduser" and the group "hadoop" to your local machine. If you would like to give hduser super-user priviliges then add hduser to the sudoers list i.e the sudo user group by using:

Stuff about SSH

For doing secure communications we need to create an asymmetric key consisting of a private/public pair of keys. The private key is kept on the computer you log in from and so the public/private key pair is created here itself to avoid the hassle of moving the private key later to another location via (maybe) insecure channels.

A pass-phrase can be added during the creation of the key pair. The work of this  phrase is to unlock the private key that will allow the decryption of the incoming encrypted data. An SSH key pass-phrase is a secondary form of security that gives you a little time to change the keys when your original keys are stolen. This can be kept blank if there is a high no. of transactions between the hosts otherwise the user will have to enter the pass-phrase to unlock the private key during each transaction. This is applicable for hduser as you don’t want to enter the pass-phrase every time Hadoop interacts with its nodes.

The public key is added to the .ssh/authorized_keys file on all the computers you want to log in to.

It is a good idea to disable the PasswordAuthentication option when configuring sshd if you don't need it. This is because a lot of people with SSH servers use weak passwords and many online attackers will look for an SSH server, then start guessing passwords at random. If PasswordAuthentication is turned On while configuring sshd then any person can try brute-forcing the password and gain access to the system running the ssh-server. By disabling it we can make sure that only approved systems can gain access to the system. After disabling PasswordAuthentication, you will need to manually add the newly created public key into the .ssh/authorized_keys file of the remote host to gain access to it. Needless to say you most probably won't be able to copy the public ssh key using ssh-copy-id as it requires password authentication to be enabled.

After configuring sshd you will have to restart sshd by using this:
During the creation of the keys, if you have named your key something other than id_rsa(or another standard name) say "OfficeKey" then you will have to use the -i option when you use ssh like this:
This will prevent errors like "Permission denied (publickey)." from popping up.

Installation

Download hadoop and then extract it to /usr/local/.

Configuration

Step 1:Add the following to your ~/.bashrc file of user hduser (if you are on bash i.e).
Step 2:
Regarding disabling IPv6, IPv6 was already disabled in Hadoop 2.6.2 as defined in the file $HADOOP_HOME/etc/hadoop/hadoop-env.sh.

In $HADOOP_HOME/etc/hadoop/hadoop-env.sh change the value of JAVA_HOME to the directory where your Java has been installed. For my system I had to change it from:
    export JAVA_HOME=${JAVA_HOME}
to:
    export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Step 3:
Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS. So we need to assign it a directory with the correct permissions.


Update $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following lines within the configuration tags.


First copy mapred-site.xml.template into mapred-site.xml and then update $HADOOP_HOME/etc/hadoop/mapred-site.xml.
Add the following lines within the configuration tags.


Add these within the configuration tags in yarn-site.xml:


Update $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following lines within the configuration tags.

Formatting HDFS filesystem
(via namenode)

This has to be done only the first time when you set up a new Hadoop cluster.

Start daemons

To start the HDFS, YARN, and MapReduce daemons, type:

A tool named jps can give the information regarding the running Hadoop processes.
Stop daemons

To stop the MapReduce, YARN, HDFS daemone, type: