This walkthrough is intended to demonstrate the installation of the fully-distributed version of Hadoop on Centos 7. I have pieced it together from numerous sites and Stack Overflow questions, none of which was wholly correct for the versions of installs I had.

It is as much documentation for me as it is intended to be a tutorial, but suggested corrections, additions and omissions are welcomed.

I hope you find it helpful!

0. Before you begin

Dependencies

Ensure that Java is installed and is compatible with the version of Hadoop that you are intending to install. A good reference site to check can be found here.

I will be continuing with:

java -version
# java version "1.8.0_121"
# Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
# Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Hosts

Your /etc/hosts file should contain the IP addresses and aliases for all of the computers you wish to be part of the cluster, e.g.:

xxx.xxx.x.xx IMPERATORIS1
xxx.xxx.x.xx IMPERATORIS2

User

Add a dedicated user:

sudo groupadd hadoop
sudo useradd -g hadoop hadoop
sudo passwd hadoop

SSH Config

Passwordless ssh-keys must be set up for the hadoop user for each node on the cluster from the master node. We use ssh-copy-id here because the users home directories are not part of a NFS.

su hadoop
ssh-keygen -t rsa # make sure to enter a passphrase!
ssh-copy-id -i ~/.ssh/id_rsa.pub IMPERATORIS1
ssh-copy-id -i ~/.ssh/id_rsa.pub IMPERATORIS2

To make sure these SSH bridges can operate without the passphrase each time, we’ll use a tool build for the job. Please see my post on ssh-agent to set this up for the mapred and yarn users.

1. Download

Go to the release page and download your desired release’s binaries, e.g.:

cd /opt
sudo wget http://mirror.stjschools.org/public/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

Un-gzip and untar, then remove the tar’ed and gzipped file:

sudo tar zxvf hadoop-2.7.3.tar.gz
sudo rm hadoop-2.7.3.tar.gz

For 3rd party packages I like to use the naming convention /opt/<software>/<software>-<version> to allow for quick switching between version builds using the environment variables. Therefore I take this step as well:

sudo mkdir hadoop
sudo mv hadoop-2.7.3/ hadoop/

Make the hadoop user and group the owner:

sudo chown -R hadoop:hadoop hadoop-2.7.3/

Wherever you put it, make sure you enter the correct HADOOP_HOME corresponding to it below.

2. Configuration

Add the Hadoop specific variables to the end of your ~/.bashrc as well, making sure to include /sbin to enable the use of the daemon scripts.

# Hadoop Environment Variables
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Source and verify your version:

source ~/.bashrc
hadoop version
# Hadoop 2.7.3
# Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
# Compiled by root on 2016-08-18T01:41Z
# Compiled with protoc 2.5.0
# From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
# This command was run using /opt/hadoop/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar

Common Configuration

We’ll need to configure a couple XML files:

cd $HADOOP_HOME/etc/hadoop/

Open the core-site.xml and configure with your appropriate hostname and Hadoop install values:

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://IMPERATORIS1:9000/</value>
   </property>
   <property>
      <name>dfs.permissions</name>
      <value>false</value>
   </property>
   <property>
     <name>hadoop.tmp.dir</name>
     <value>/opt/hadoop/hadoop-2.7.3/dfs/tmp</value>
   </property>
</configuration>

Open the hdfs-site.xml and configure with your appropriate Hadoop install values:

<configuration>
   <property>
      <name>dfs.data.dir</name>
      <value>/opt/hadoop/hadoop-2.7.3/dfs/name/data</value>
      <final>true</final>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>/opt/hadoop/hadoop-2.7.3/dfs/name</value>
      <final>true</final>
   </property>

   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
</configuration>

Open the mapred-site.xml and configure with your appropriate hostname and port.

<configuration>
   <property>
      <name>mapred.job.tracker</name>
      <value>IMPERATORIS1:9001</value>
   </property>
</configuration>

Open the hadoop-env.sh and set the Java home to your install location:

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-openjdk

Also add these lines at the beginning of the script:

# Set Hadoop-specific environment variables here.
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.3
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

3. Installing on Slaves

Copy the install directory to each slave (/opt/hadoop should exist on each):

cd /opt/hadoop
sudo scp -r hadoop-2.7.3 IMPERATORIS2:/opt/hadoop/

4. Configure Master Node

Some additional config is needed on the master node only:

cd /opt/hadoop/hadoop-2.7.3/etc/hadoop
sudo vim masters
IMPERATORIS1
sudo vim slaves
IMPERATORIS2

5. Start Hadoop

As the hadoop user:

hadoop namenode –format
# DEPRECATED: Use of this script to execute hdfs command is deprecated.
#  Instead use the hdfs command for it.
#
#  17/01/01 17:29:59 INFO namenode.NameNode: STARTUP_MSG:
#  /************************************************************
#  STARTUP_MSG: Starting NameNode
#  STARTUP_MSG:   host = IMPERATORIS1/192.168.0.16
#  STARTUP_MSG:   args = [-format]
#  STARTUP_MSG:   version = 2.7.3
#  STARTUP_MSG:
#
#  ... <ommitted>
#
# 17/01/01 17:30:01 INFO util.ExitUtil: Exiting with status 0
# 17/01/01 17:30:01 INFO namenode.NameNode: SHUTDOWN_MSG:
#  /************************************************************
#  SHUTDOWN_MSG: Shutting down NameNode at IMPERATORIS1/xxx.xxx.x.xx
#  ************************************************************/
cd $HADOOP_HOME/sbin
sudo sh start-all.sh
# This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
# 17/01/01 17:48:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# Starting namenodes on [IMPERATORIS1]
# IMPERATORIS1: starting namenode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-root-namenode-IMPERATORIS1.out
# IMPERATORIS2: starting datanode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-IMPERATORIS2.out
# Starting secondary namenodes [0.0.0.0]
# 0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-IMPERATORIS1.out
# 17/01/01 17:48:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# starting yarn daemons
# starting resourcemanager, logging to /opt/hadoop/hadoop-2.7.3/logs/yarn-root-resourcemanager-IMPERATORIS1.out
# IMPERATORIS2: starting nodemanager, logging to /opt/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-IMPERATORIS2.out

Check that everything is running, on the master node:

which jps
# /usr/lib/jvm/java-openjdk/bin/jps
sudo /usr/lib/jvm/java-openjdk/bin/jps
# 28478 Jps
# 27944 SecondaryNameNode
# 27568 NameNode
# 28123 ResourceManager

On the slaves:

which jps
# /usr/bin/jps
sudo /usr/bin/jps
# 23529 NodeManager
# 23431 DataNode
# 23668 Jps

You’re now all ready to go!