BigData Investigation 9 – Installing Apache Hadoop in Pseudo-Distributed Mode

hadoop-in-pseudo-distributed-modeIn this post I will explain how to configure Apache Hadoop in Pseudo-Distributed Mode. In an earlier post I have explained how to install Apache Hadoop in Local (Standalone) Mode. Now I will apply the required configuration changes to turn that cluster into Pseudo-Distributed Mode.

Step 1 – Install Apache Hadoop in Local (Standalone) Mode: As starting point we need an Apache Hadoop cluster in Local (Standalone) Mode. In an earlier post I have described how to create such a cluster on CentOS 7 Linux running on VirtualBox (BigData Investigation 7 – Installing Apache Hadoop in Local (Standalone) Mode). Before executing the next steps, I cloned my Linux server in VirtualBox just in case I need a Hadoop Cluster in Local (Standalone) Mode in the future.

Step 2 –  Configure password-less ssh: My Hadoop Cluster is running Hadoop 2.7.2. The documentation to configure Pseudo-Distributed Mode is here. I start with the configuration of password-less ssh to localhost. Ssh is required by the helper scripts which start and stop the Hadoop services and some other utilities. The Hadoop framework itself does not required ssh.

Login in as the user who installed the Hadoop code – on my system ‘storageulf’ – and then ssh to localhost. After adding the fingerprint of my server to the list of known hosts, the ssh command asks for the password. Abort the ssh command with ‘CTRL-C’.

login as: storageulf
storageulf@127.0.0.1's password:
Last login: Mon Sep 26 03:39:56 2016 from gateway

[storageulf@hadoop ~]$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is 01:4f:be:db:f7:60:2a:d5:ee:67:0b:aa:2e:60:f9:e7.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
storageulf@localhost's password:

[storageulf@hadoop ~]$

So, we need to create an ssh-key pair with the Linux ssh-keygen command and add the public key to the authorized_keys file. Now we can ssh to localhost without being prompted for a password. Finally logout from the remote shell to return to the login shell.

[storageulf@hadoop ~]$ ssh-keygen -t rsa -b 2048 -P '' -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /home/storageulf/.ssh/id_rsa.
Your public key has been saved in /home/storageulf/.ssh/id_rsa.pub.
The key fingerprint is:
ec:f0:9a:52:78:fc:f4:04:fd:e6:7c:8a:ed:af:9c:b6 storageulf@hadoop.storageulf
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|         .       |
|       .. .      |
|     o. S. .     |
|    . ++. . o    |
|     o ooo +     |
|    .  o. .++..  |
|     .o   .oE*.  |
+-----------------+

[storageulf@hadoop ~]$ ls ~/.ssh/
id_rsa  id_rsa.pub  known_hosts

[storageulf@hadoop ~]$ touch .ssh/authorized_keys
[storageulf@hadoop ~]$ chmod 0600 ~/.ssh/authorized_keys
[storageulf@hadoop ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

[storageulf@hadoop ~]$ ssh localhost
Last login: Tue Sep 27 11:58:10 2016 from gateway

[storageulf@hadoop ~]$ logout
logout
Connection to localhost closed.

[storageulf@hadoop ~]$

Step 3 –  Configure Hadoop environment variables: Next we need to configure JAVA_HOME in hadoop-env.sh. Scripts which start the Hadoop services fail, if JAVA_HOME is not configured properly.

[storageulf@hadoop ~]$ set | grep JAVA_HOME
JAVA_HOME=/etc/alternatives/jre_openjdk

[storageulf@hadoop ~]$ which hadoop
~/hadoop/hadoop-2.7.2/bin/hadoop

[storageulf@hadoop ~]$ cd ~/hadoop/hadoop-2.7.2/etc/hadoop/

[storageulf@hadoop hadoop]$ cp hadoop-env.sh hadoop-env.orig

[storageulf@hadoop hadoop]$ vi hadoop-env.sh

[storageulf@hadoop hadoop]$ diff hadoop-env.sh hadoop-env.orig
25,26c25
< ###export JAVA_HOME=${JAVA_HOME}
< JAVA_HOME=/etc/alternatives/jre_openjdk
---
> export JAVA_HOME=${JAVA_HOME}

[storageulf@hadoop hadoop]$

Step 4 –  Configure Hadoop Distributed File System (HDFS): To configure HDFS, we need to overwrite two default settings. First we need to set the default filesystem to ‘hdfs://localhost’ in ‘core-site.xml’. The documentation specifies also the port (‘hdfs://localhost:9000’), but I decided to specify no port. The web is full of examples, some are specifying port 9000, others specify no port. Though I have not found a single argument why to specify port 9000. Therefore I have decided to go with the default port which is 8020/tcp.

Second, HDFS creates per default three replicas of each file. For our single-node cluster we need to set the number of replicas to ‘1’ in hdfs-site.xml. On a new system both files include only comments, so that we just overwrite them.

[storageulf@hadoop hadoop]$ pwd
/home/storageulf/hadoop/hadoop-2.7.2/etc/hadoop

[storageulf@hadoop hadoop]$ cat > core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost</value>
    </property>
</configuration>

[storageulf@hadoop hadoop]$ cat > hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

[storageulf@hadoop hadoop]$

Step 5 –  Format the filesystem:  Now we are ready to format the filesystem. The complete output of the hdfs format command is available on GitHub.

[storageulf@hadoop hadoop]$ cd

[storageulf@hadoop ~]$ which hdfs
~/hadoop/hadoop-2.7.2/bin/hdfs

[storageulf@hadoop ~]$ hdfs namenode -format
16/09/27 14:46:01 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop.storageulf/10.0.2.15
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.7.2
...
STARTUP_MSG:   java = 1.8.0_101
************************************************************/
16/09/27 14:46:01 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
16/09/27 14:46:01 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-d3cb987d-000a-4a81-8c1c-28aaf4be867b
...
16/09/27 14:46:02 INFO blockmanagement.BlockManager: defaultReplication         = 1
...
16/09/27 14:46:02 INFO namenode.FSNamesystem: fsOwner             = storageulf (auth:SIMPLE)
16/09/27 14:46:02 INFO namenode.FSNamesystem: supergroup          = supergroup
16/09/27 14:46:02 INFO namenode.FSNamesystem: isPermissionEnabled = true
16/09/27 14:46:02 INFO namenode.FSNamesystem: HA Enabled: false
16/09/27 14:46:02 INFO namenode.FSNamesystem: Append Enabled: true
16/09/27 14:46:02 INFO namenode.FSDirectory: ACLs enabled? false
16/09/27 14:46:02 INFO namenode.FSDirectory: XAttrs enabled? true
16/09/27 14:46:02 INFO namenode.FSDirectory: Maximum size of an xattr: 16384
...
16/09/27 14:46:02 INFO common.Storage: Storage directory /tmp/hadoop-storageulf/dfs/name has been successfully formatted.
16/09/27 14:46:02 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
16/09/27 14:46:02 INFO util.ExitUtil: Exiting with status 0
16/09/27 14:46:02 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop.storageulf/10.0.2.15
************************************************************/

[storageulf@hadoop ~]$

Step 6 –  Configure MapReduce and YARN: To run MapReduce on YARN in Pseudo-Distributed mode we need to set a few additional parameters. Again the respective configuration files include only comments, so that we overwrite the whole files.

[storageulf@hadoop ~]$ cd hadoop/hadoop-2.7.2/etc/hadoop/

[storageulf@hadoop hadoop]$ cat > mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

[storageulf@hadoop hadoop]$ cat > yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
[storageulf@hadoop hadoop]$ cd ~

[storageulf@hadoop ~]$

Step 7 –  Start the services: Finally we are ready to start the services. We start HDFS (start-dfs.sh), YARN (start-yarn.sh) and the MapReduce Job History server (mr-jobhistory-daemon.sh).

[storageulf@hadoop ~]$ which start-dfs.sh
~/hadoop/hadoop-2.7.2/sbin/start-dfs.sh

[storageulf@hadoop ~]$ start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/hadoop-storageulf-namenode-hadoop.storageulf.out
localhost: starting datanode, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/hadoop-storageulf-datanode-hadoop.storageulf.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is 01:4f:be:db:f7:60:2a:d5:ee:67:0b:aa:2e:60:f9:e7.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/hadoop-storageulf-secondarynamenode-hadoop.storageulf.out

[storageulf@hadoop ~]$ which start-yarn.sh
~/hadoop/hadoop-2.7.2/sbin/start-yarn.sh

[storageulf@hadoop ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/yarn-storageulf-resourcemanager-hadoop.storageulf.out
localhost: starting nodemanager, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/yarn-storageulf-nodemanager-hadoop.storageulf.out

[storageulf@hadoop ~]$ which mr-jobhistory-daemon.sh
~/hadoop/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh

[storageulf@hadoop ~]$ mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/mapred-storageulf-historyserver-hadoop.storageulf.out

[storageulf@hadoop ~]$

Step 8 –  Check results: Now our Pseudo-Distributed cluster is completely configured. Let’s check which processes are running. There are six Hadoop process running: namenode, datanode and secondarynamenode for HDFS, resourcemanager and nodemanger for YARN, and historyserver for the MapReduce Job History server.

[storageulf@hadoop ~]$ ps x
  PID TTY      STAT   TIME COMMAND
14937 ?        S      0:00 sshd: storageulf@pts/0
14938 pts/0    Ss     0:00 -bash
16257 ?        Sl     0:05 /etc/alternatives/jre_openjdk/bin/java -Dproc_namenode ...
16378 ?        Sl     0:04 /etc/alternatives/jre_openjdk/bin/java -Dproc_datanode ... 
16533 ?        Sl     0:03 /etc/alternatives/jre_openjdk/bin/java -Dproc_secondarynamenode ...
16685 pts/0    Sl     0:09 /etc/alternatives/jre_openjdk/bin/java -Dproc_resourcemanager ...
16799 ?        Sl     0:07 /etc/alternatives/jre_openjdk/bin/java -Dproc_nodemanager ...
17123 pts/0    Sl     0:05 /etc/alternatives/jre_openjdk/bin/java -Dproc_historyserver ...
17315 pts/0    R+     0:00 ps x
[storageulf@hadoop ~]$

The six services listen on a plenty of TCP sockets and establish a few sessions among themselves. I will study the usage of the ports in a later post.

[storageulf@hadoop ~]$ lsof -i -P
COMMAND   PID       USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    16257 storageulf  183u  IPv4  71367      0t0  TCP *:50070 (LISTEN)
java    16257 storageulf  200u  IPv4  71381      0t0  TCP localhost:8020 (LISTEN)
java    16257 storageulf  210u  IPv4  72422      0t0  TCP localhost:8020->localhost:34896 (ESTABLISHED)
java    16378 storageulf  183u  IPv4  72179      0t0  TCP *:50010 (LISTEN)
java    16378 storageulf  187u  IPv4  72190      0t0  TCP localhost:41897 (LISTEN)
java    16378 storageulf  210u  IPv4  72406      0t0  TCP *:50075 (LISTEN)
java    16378 storageulf  213u  IPv4  72413      0t0  TCP *:50020 (LISTEN)
java    16378 storageulf  226u  IPv4  72418      0t0  TCP localhost:34896->localhost:8020 (ESTABLISHED)
java    16533 storageulf  194u  IPv4  73412      0t0  TCP *:50090 (LISTEN)
java    16685 storageulf  193u  IPv6  74413      0t0  TCP *:8031 (LISTEN)
java    16685 storageulf  204u  IPv6  74434      0t0  TCP *:8030 (LISTEN)
java    16685 storageulf  214u  IPv6  74639      0t0  TCP *:8032 (LISTEN)
java    16685 storageulf  224u  IPv6  76657      0t0  TCP *:8088 (LISTEN)
java    16685 storageulf  231u  IPv6  78101      0t0  TCP *:8033 (LISTEN)
java    16685 storageulf  241u  IPv6  78121      0t0  TCP hadoop.storageulf:8031->hadoop.storageulf:44274 (ESTABLISHED)
java    16799 storageulf  200u  IPv6  78083      0t0  TCP *:39078 (LISTEN)
java    16799 storageulf  211u  IPv6  78092      0t0  TCP *:8040 (LISTEN)
java    16799 storageulf  221u  IPv6  78096      0t0  TCP *:13562 (LISTEN)
java    16799 storageulf  222u  IPv6  78111      0t0  TCP *:8042 (LISTEN)
java    16799 storageulf  231u  IPv6  78113      0t0  TCP hadoop.storageulf:44274->hadoop.storageulf:8031 (ESTABLISHED)
java    17123 storageulf  200u  IPv4  78420      0t0  TCP *:10033 (LISTEN)
java    17123 storageulf  211u  IPv4  78444      0t0  TCP *:19888 (LISTEN)
java    17123 storageulf  218u  IPv4  78449      0t0  TCP *:10020 (LISTEN)

[storageulf@hadoop ~]$

Ulf’s Conclusion

Apache Hadoop installs per default in Local (Standalone) Mode. Turning a fresh installed Apache Hadoop cluster into Pseudo-Distributed Mode is straight forward. It just needs the adjustment of a few configuration files and the start of the services. Each Hadoop service (e.g. HDFS NameNode, HDFS DataNode, HDFS Secondary NameNode, YARN ResourceManager, YARN NodeManager, MapReduce Job History Server) runs in a separate Java Virtual Machine (JVM) in separate Linux processes.

In the next post I will explain how to run the example Hadoop Streaming application which I used in earlier posts on my new Pseudo-Distributed Mode Apache Hadoop Cluster to better understand the differences between Local (Standalone) Mode and Pseudo-Distributed Mode.

Changes:
2016/10/28 added link – “on my new Pseudo-Distributed Mode Apache Hadoop Cluster” => BigData Investigation 10 – Using Hadoop Streaming on Hadoop Cluster in Pseudo-Distributed Mode

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.