BigData Investigation 10 – Using Hadoop Streaming on Hadoop Cluster in Pseudo-Distributed Mode

In this post I will explain how to run the Hadoop Streaming utility on a Hadoop Cluster in Pseudo-Distributed Mode. Hadoop Streaming uses executables or scripts to create a MapReduce job and submits the job to a Hadoop cluster. In an earlier post I have explained how to run Hadoop Streaming in Standalone (Local) Mode. Standalone (Local) Mode runs all Hadoop services in a single Java Virtual Machine (JVM). Pseudo-Distributed Mode is a single node cluster, but the Hadoop services are running in separate JVMs. Pseudo-Distributed Mode is closer to a production configuration than Standalone (Local) Mode.

Starting point for this blog post is a single-node Apache Hadoop Cluster which is configured in Pseudo-Distributed Mode. In the previous post I have described how to configure such a cluster (BigData Investigation 9 – Installing Apache Hadoop in Pseudo-Distributed Mode). I highly recommend that you read that post, before you continue to read this post.

As first step we need to login as the user who configured Hadoop – storageulf on my system – and create a home directory for that user in the Hadoop Distributed File System (HDFS). The default HDFS home directory path is ‘/user/<username>’.

login as: storageulf
storageulf@127.0.0.1's password:

[storageulf@hadoop ~]$ hdfs dfs -ls /
Found 1 items
drwxrwx---   - storageulf supergroup          0 2016-09-27 15:30 /tmp

[storageulf@hadoop ~]$ hdfs dfs -mkdir /user

[storageulf@hadoop ~]$ hdfs dfs -mkdir /user/storageulf

[storageulf@hadoop ~]$ hdfs dfs -ls /
Found 2 items
drwxrwx---   - storageulf supergroup          0 2016-09-27 15:30 /tmp
drwxr-xr-x   - storageulf supergroup          0 2016-09-27 19:20 /user

[storageulf@hadoop ~]$ hdfs dfs -ls /user
Found 1 items
drwxr-xr-x - storageulf supergroup 0 2016-09-27 19:20 /user/storageulf

Next we need to copy the example data from the Linux filesystem to HDFS. For my BigData investigation I am using the example data and the example code which are included in the Hadoop Book. On my system all example files are already copied to /home/storageulf/hadoop-book in the Linux filesystem. You can get all example files using the following command: ‘git clone https://github.com/tomwhite/hadoop-book.git’. See this post for more details. I am copying all example files to HDFS, in case that I want to try other examples later on and to get some more files into HDFS.

[storageulf@hadoop ~]$ ls
directory  hadoop  hadoop-book  output

[storageulf@hadoop ~]$ hdfs dfs -put hadoop-book /user/storageulf

[storageulf@hadoop ~]$ hdfs dfs -ls /user/storageulf
Found 1 items
drwxr-xr-x   - storageulf supergroup          0 2016-09-27 19:22 /user/storageulf/hadoop-book

[storageulf@hadoop ~]$ hdfs dfs -ls
Found 1 items
drwxr-xr-x   - storageulf supergroup          0 2016-09-27 19:22 hadoop-book

[storageulf@hadoop ~]$

Here is a quick check to validate the path to the example data which I need for this post.

[storageulf@hadoop ~]$ hdfs dfs -ls hadoop-book/input/ncdc/sample.txt
-rw-r--r--   1 storageulf supergroup        529 2016-09-27 19:22 hadoop-book/input/ncdc/sample.txt

[storageulf@hadoop ~]$

Finally we need to run the Hadoop job. We need slightly adjust the syntax of the command which we have used on the cluster in Local (Standalone) Mode – see that post for details. For your convenience I am copying that command here.

[storageulf@hadoop ~]$  hadoop jar /home/storageulf/hadoop/hadoop-2.7.2/share/hadoop/tools/lib/
  hadoop-streaming-2.7.2.jar \
-input ~/hadoop-book/input/ncdc/sample.txt \
-output ~/output \
-mapper ~/hadoop-book/ch02-mr-intro/src/main/python/max_temperature_map.py \
-reducer ~/hadoop-book/ch02-mr-intro/src/main/python/max_temperature_reduce.py

To run the same Python scripts in Pseudo-Distributed Mode we need to point the input to the file with example data in HDFS. That’s it. The mapper and the reducer scripts are still read from the Linux filesystem. The prefix of the path (‘/home/…’ vs. /user/…’) tells implicitly whether the is a Linux filesystem path or a HDFS path. As usual I made the complete output available at GitHub.

[storageulf@hadoop ~]$ hadoop jar /home/storageulf/hadoop/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-input /user/storageulf/hadoop-book/input/ncdc/sample.txt \
-output output2 \
-mapper /home/storageulf/hadoop-book/ch02-mr-intro/src/main/python/max_temperature_map.py \
-reducer /home/storageulf/hadoop-book/ch02-mr-intro/src/main/python/max_temperature_reduce.py
...
16/09/27 20:02:58 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
...
16/09/27 20:02:59 INFO mapred.FileInputFormat: Total input paths to process : 1
...
16/09/27 20:03:54 INFO streaming.StreamJob: Output directory: output2

The result of the task is available in file ‘part-00000’ and the content is – as expected – the same as on the cluster in Local (Standalone) Mode.

[storageulf@hadoop ~]$ hdfs dfs -ls /user/storageulf/
Found 4 items
drwxr-xr-x   - storageulf supergroup          0 2016-09-27 19:22 /user/storageulf/hadoop-book
drwxr-xr-x   - storageulf supergroup          0 2016-09-27 19:57 /user/storageulf/output
drwxr-xr-x   - storageulf supergroup          0 2016-09-27 20:01 /user/storageulf/output1
drwxr-xr-x   - storageulf supergroup          0 2016-09-27 20:03 /user/storageulf/output2

[storageulf@hadoop ~]$ hdfs dfs -ls /user/storageulf/output2
Found 2 items
-rw-r--r--   1 storageulf supergroup          0 2016-09-27 20:03 /user/storageulf/output2/_SUCCESS
-rw-r--r--   1 storageulf supergroup         17 2016-09-27 20:03 /user/storageulf/output2/part-00000

[storageulf@hadoop ~]$ hdfs dfs -cat /user/storageulf/output2/part-00000
1949    111
1950    22

[storageulf@hadoop ~]$

Last but not least I have checked, if files which are stored in HDFS can be found in a Linux filesystem, but that is not the case.

[storageulf@hadoop ~]$ sudo updatedb
[sudo] password for storageulf:

[storageulf@hadoop ~]$ locate sample.txt
/home/storageulf/hadoop-book/ch06-mr-dev/input/ncdc/micro/sample.txt
/home/storageulf/hadoop-book/ch18-crunch/src/test/resources/sample.txt
/home/storageulf/hadoop-book/input/ncdc/sample.txt
/home/storageulf/hadoop-book/input/ncdc/sample.txt.gz
/home/storageulf/hadoop-book/input/ncdc/micro/sample.txt
/home/storageulf/hadoop-book/input/ncdc/micro-tab/sample.txt
[storageulf@hadoop ~]$

Ulf’s Conclusion

Running the Hadoop Streaming example application on our home-made Hadoop Cluster in Pseudo-Distributed Mode delivers the same results as on the home-mode Hadoop Cluster in Local (Standalone) Mode and on the Cloudera QuickStart VM. This again was actually expected, but it is again good to know. By configuring Apache Hadoop in Standalone (Local) Mode and Pseudo-Distributed Mode and running the Hadoop Streaming example on both configurations I now feel pretty comfortable with Hadoop. I am by far not an expert, but i feel ready for diving deeper into Hadoop & Co.

In the next post I will introduce the WebUIs of various services which are are running in a Hadoop Cluster. I already peeked into the ports where the Hadoop services are listening and found some interesting information which improve the understanding of Hadoop.

Share this article

BigData Investigation 10 – Using Hadoop Streaming on Hadoop Cluster in Pseudo-Distributed Mode

Leave a Reply Cancel reply