BigData Investigation 7 – Installing Apache Hadoop in Local (Standalone) Mode

hadoop-in-local-standalone-modeIn this post I will explain how to download Apache Hadoop and install it on CentOS 7 Linux in Local (Standalone) Mode.

In earlier posts I have used the Cloudera Quickstart VM to describe how to create MapReduce applications with Python and Hadoop Streaming. Using pre-configured Hadoop clusters like the Cloudera Quickstart VM is convenient for starting with Hadoop. But installing Hadoop on my own helps me to better understand the internals of Hadoop. Local (Standalone) Mode is one of the three supported Hadoop cluster modes. I will explain how to install Hadoop in the other two cluster modes in future posts.

Step 1 – Install a new CentOS 7 server: I have created a new virtual machine (VM) on VirtualBox with the same memory size (4GB) and disk size (64GB) as the Cloudera QuickStart VM. VirtualBox is a free and easy to use hypervisor for x86 processors provided by Oracle. You can use VirtualBox, any other hypervisor, a bare metal machine or a machine running in a cloud. This does not matter.

Download the CentOS 7 DVD ISO image here. The size of the image is 4GB, so that the download may take some time. Finally the ISO image needs to be mounted into the DVD drive of the VM. The screenshot shows the settings of my new VM before I pressed the Start button.Step 1a

I have selected a Minimal Install, because I want to understand which rpms are prerequisite for Hadoop. Enable the network to allow SSH login to the VM. See my kickstart file on GitHub for more details on the CentOS settings.Step 1b

Reboot the machine after the installation is completed. Then configure port forwarding and logon to the new server. See my earlier posting ‘BigData Investigation 3: Installing the Cloudera QuickStart VM on VirtualBox’ for more details on port forwarding for VirtualBox.Step 1c

login as: storageulf
storageulf@127.0.0.1's password:

[storageulf@hadoop ~]$

Next we update the rpms to get the latest fixes. There were 83 updates and a new kernel available when I wrote this post.

[storageulf@hadoop ~]$ sudo yum update --assumeyes
...
[sudo] password for storageulf:
Loaded plugins: fastestmirror
...
Transaction Summary
================================================================================
Install   1 Package
Upgrade  83 Packages
...
Complete!

[storageulf@hadoop ~]$

Finally we reboot the server in order to activate the new kernel.

[storageulf@hadoop ~]$ reboot

Step 2 – Install Java: Hadoop requires a Java runtime environment (Java RTE) which is not included in a CentOS 7 Minimal Install. Installing the Java RTE installs a few prerequisite rpms.

[storageulf@hadoop ~]$ which java
/usr/bin/which: no java in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/storageulf/.local/bin:/home/storageulf/bin)

[storageulf@hadoop ~]$ sudo yum install java-1.8.0-openjdk --assumeyes
Loaded plugins: fastestmirror
...
Transaction Summary
================================================================================
Install  1 Package (+27 Dependent packages)
...
Complete!

[storageulf@hadoop ~]$ which java
/usr/bin/java

[storageulf@hadoop ~]$ java -version
openjdk version "1.8.0_101"
OpenJDK Runtime Environment (build 1.8.0_101-b13)
OpenJDK 64-Bit Server VM (build 25.101-b13, mixed mode)

[storageulf@hadoop ~]$

Step 3 – Get Hadoop from hadoop.apache.org: Now it is time to get Hadoop from hadoop.apache.org. Latest stable releases are available here. When I wrote this post, the latest stable release was Hadoop 2.7.2. The size of the file is about 200M, so it takes another few minutes to download it.

[storageulf@hadoop ~]$ mkdir hadoop

[storageulf@hadoop ~]$ cd hadoop

[storageulf@hadoop hadoop]$ wget http://apache.mirror.iphh.net/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
-bash: wget: command not found

[storageulf@hadoop hadoop]$ sudo yum install wget --assumeyes
Loaded plugins: fastestmirror
...
Complete!

[storageulf@hadoop hadoop]$ wget http://apache.mirror.iphh.net/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
--2016-08-21 20:15:57--  http://apache.mirror.iphh.net/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
...
2016-08-21 20:20:54 (697 KB/s) - ‘hadoop-2.7.2.tar.gz’ saved [212046774/212046774]

[storageulf@hadoop hadoop]$

Next we need to get the signature file and the KEYS to validate the integrity of the downloaded file.

[storageulf@hadoop hadoop]$ wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz.asc
...
2016-08-21 20:38:00 (41.7 MB/s) - ‘hadoop-2.7.2.tar.gz.asc’ saved [535/535]

[storageulf@hadoop hadoop]$ wget https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
...
2016-08-21 20:38:12 (244 KB/s) - ‘KEYS’ saved [198493/198493]

[storageulf@hadoop hadoop]$

Finally we import the KEYS and check the integrity of the downloaded file. The warning is acceptable for our purposes, because we have downloaded the KEYS file from apache.org and not a mirror site.

[storageulf@hadoop hadoop]$ ls -l
total 207284
-rw-rw-r--. 1 storageulf storageulf 212046774 Jan 26  2016 hadoop-2.7.2.tar.gz
-rw-rw-r--. 1 storageulf storageulf       535 Jan 26  2016 hadoop-2.7.2.tar.gz.asc
-rw-rw-r--. 1 storageulf storageulf    198493 Mar 18 00:42 KEYS

[storageulf@hadoop hadoop]$ gpg --import KEYS
gpg: directory `/home/storageulf/.gnupg' created
...
gpg:               imported: 33  (RSA: 25)
gpg: no ultimately trusted keys found

[storageulf@hadoop hadoop]$ gpg --verify hadoop-2.7.2.tar.gz.asc
gpg: Signature made Tue 26 Jan 2016 01:36:28 AM CET using RSA key ID C36C5F0F
gpg: Good signature from "Vinod Kumar Vavilapalli (I am also known as @tshooter.) <vinodkv@apache.org>"
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.
Primary key fingerprint: 6AE7 0A2A 38F4 66A5 D683  F939 255A DF56 C36C 5F0F

[storageulf@hadoop hadoop]$

Step 4 – Installing Hadoop: Finally we need to extract the downloaded tarball and configure three environment variables.

[storageulf@hadoop hadoop]$ tar xzf hadoop-2.7.2.tar.gz

[storageulf@hadoop hadoop]$ ls
hadoop-2.7.2  hadoop-2.7.2.tar.gz  hadoop-2.7.2.tar.gz.asc  KEYS

[storageulf@hadoop hadoop]$ ls hadoop-2.7.2
bin  include  libexec      NOTICE.txt  sbin
etc  lib      LICENSE.txt  README.txt  share

[storageulf@hadoop hadoop]$ ls -l /etc/alternatives/jre_1.8.0_openjdk
lrwxrwxrwx. 1 root root 59 Aug 21 19:55 /etc/alternatives/jre_1.8.0_openjdk -> /usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.101-3.b13.el7_2.x86_64

[storageulf@hadoop hadoop]$ cat >> ~/.bash_profile
export JAVA_HOME=/etc/alternatives/jre_openjdk
export HADOOP_HOME=/home/storageulf/hadoop/hadoop-2.7.2
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

[storageulf@hadoop hadoop]$ tail -3 ~/.bash_profile
export JAVA_HOME=/etc/alternatives/jre_openjdk
export HADOOP_HOME=/home/storageulf/hadoop/hadoop-2.7.2
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

[storageulf@hadoop hadoop]$ . ~/.bash_profile

[storageulf@hadoop hadoop]$ env | grep -i hadoop
HOSTNAME=hadoop.storageulf
HADOOP_HOME=/home/storageulf/hadoop/hadoop-2.7.2
PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/storageulf/.local/bin:/home/storageulf/bin:/home/storageulf/hadoop/hadoop-2.7.2/bin:/home/storageulf/hadoop/hadoop-2.7.2/sbin
PWD=/home/storageulf/hadoop

[storageulf@hadoop hadoop]$

Now Hadoop is ready to use. Without any further configuration changes the new Hadoop Cluster is runnnig in Local (Standalone) Mode.

[storageulf@hadoop hadoop]$ hadoop version
Hadoop 2.7.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by jenkins on 2016-01-26T00:08Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /home/storageulf/hadoop/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

[storageulf@hadoop hadoop]$

Ulf’s Conclusion

Getting Hadoop from hadoop.apache.org and installing it on a new CentOS 7 Linux server is pretty easy. On top of the CentOS 7 Minimal Install we just need a Java RTE and a tarball with the latest Hadoop release. Hadoop installs per default in Local (Standalone) Mode.

In the next post I will use my new system to explain how to run Hadoop Streaming on a cluster in Local (Standalone) Mode.

Changes:
2016/10/07 added link – “how to run Hadoop Streaming on a cluster in Local (Standalone) Mode” => BigData Investigation 8 – Using Hadoop Streaming on Hadoop Cluster in Local (Standalone) Mode

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.