Sunday, 10 January 2016

Hadoop & SSH

There were many excellent resources online that explain the installation of a single node pseudo-distributed Hadoop installation like here. But I saw that many of these existing instructions are for installing older versions of Hadoop and there has been some minor changes since then. In this post I will be explaining how I installed Hadoop 2.6.2 on Ubuntu 15.10 along with some cool stuff about SSH.

Install Java
Hadoop requires a working Java 1.5+ installation. Get it here.

Add a new user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps in keeping your original user account clean & keeps the hadoop user account secure.

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
view raw hadoop1.sh hosted with ❤ by GitHub
This will add the user "hduser" and the group "hadoop" to your local machine. If you would like to give hduser super-user priviliges then add hduser to the sudoers list i.e the sudo user group by using:
sudo adduser hduser sudo
view raw hadoop2.sh hosted with ❤ by GitHub

Stuff about SSH

For doing secure communications we need to create an asymmetric key consisting of a private/public pair of keys. The private key is kept on the computer you log in from and so the public/private key pair is created here itself to avoid the hassle of moving the private key later to another location via (maybe) insecure channels.

A pass-phrase can be added during the creation of the key pair. The work of this  phrase is to unlock the private key that will allow the decryption of the incoming encrypted data. An SSH key pass-phrase is a secondary form of security that gives you a little time to change the keys when your original keys are stolen. This can be kept blank if there is a high no. of transactions between the hosts otherwise the user will have to enter the pass-phrase to unlock the private key during each transaction. This is applicable for hduser as you don’t want to enter the pass-phrase every time Hadoop interacts with its nodes.

The public key is added to the .ssh/authorized_keys file on all the computers you want to log in to.

It is a good idea to disable the PasswordAuthentication option when configuring sshd if you don't need it. This is because a lot of people with SSH servers use weak passwords and many online attackers will look for an SSH server, then start guessing passwords at random. If PasswordAuthentication is turned On while configuring sshd then any person can try brute-forcing the password and gain access to the system running the ssh-server. By disabling it we can make sure that only approved systems can gain access to the system. After disabling PasswordAuthentication, you will need to manually add the newly created public key into the .ssh/authorized_keys file of the remote host to gain access to it. Needless to say you most probably won't be able to copy the public ssh key using ssh-copy-id as it requires password authentication to be enabled.

After configuring sshd you will have to restart sshd by using this:
sudo systemctl restart ssh
view raw hadoop3.sh hosted with ❤ by GitHub
During the creation of the keys, if you have named your key something other than id_rsa(or another standard name) say "OfficeKey" then you will have to use the -i option when you use ssh like this:
ssh -i /path/to/privatekey user@host
view raw hadoop4.sh hosted with ❤ by GitHub
This will prevent errors like "Permission denied (publickey)." from popping up.

Installation

Download hadoop and then extract it to /usr/local/.
# This location for the extraction of the tarball becomes HADOOP_HOME later below.
cd /usr/local
sudo tar xzf hadoop-2.6.2.tar.gz
sudo mv hadoop-2.6.2 hadoop
sudo chown -R hduser:hadoop hadoop
view raw hadoop5.sh hosted with ❤ by GitHub

Configuration

Step 1:Add the following to your ~/.bashrc file of user hduser (if you are on bash i.e).
# You have to assign the path where you extracted the tarball as HADOOP_HOME.
export HADOOP_HOME=/usr/local/hadoop
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
view raw hadoop6.sh hosted with ❤ by GitHub
Step 2:
Regarding disabling IPv6, IPv6 was already disabled in Hadoop 2.6.2 as defined in the file $HADOOP_HOME/etc/hadoop/hadoop-env.sh.

In $HADOOP_HOME/etc/hadoop/hadoop-env.sh change the value of JAVA_HOME to the directory where your Java has been installed. For my system I had to change it from:
    export JAVA_HOME=${JAVA_HOME}
to:
    export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Step 3:
Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS. So we need to assign it a directory with the correct permissions.
sudo mkdir -p /app/hadoop/tmp
sudo chown hduser:hadoop /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmp
view raw hadoop7.sh hosted with ❤ by GitHub


Update $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following lines within the configuration tags.
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem. This is now used
instead of fs.default.name as it is deprecated</description>
</property>


First copy mapred-site.xml.template into mapred-site.xml and then update $HADOOP_HOME/etc/hadoop/mapred-site.xml.
Add the following lines within the configuration tags.
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>If needed, then run hadoop in local mode for debugging so mapper and reducer tasks run in a single JVM instead of separate JVMs</description>
</property>


Add these within the configuration tags in yarn-site.xml:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
view raw hadoop_yarn.xml hosted with ❤ by GitHub


Update $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following lines within the configuration tags.
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

Formatting HDFS filesystem
(via namenode)

This has to be done only the first time when you set up a new Hadoop cluster.
hdfs namenode -format
view raw hadoop8.sh hosted with ❤ by GitHub

Start daemons

To start the HDFS, YARN, and MapReduce daemons, type:
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver
view raw hadoop9.sh hosted with ❤ by GitHub

A tool named jps can give the information regarding the running Hadoop processes.
Stop daemons

To stop the MapReduce, YARN, HDFS daemone, type:
mr-jobhistory-daemon.sh stop historyserver
stop-yarn.sh
stop-dfs.sh
view raw hadoop10.sh hosted with ❤ by GitHub