BigData Investigation 3 – Installing the Cloudera QuickStart VM on VirtualBox

In this post I will show how to install the Cloudera Quickstart VM on VirtualBox. I need a Hadoop cluster to try the examples in the Hadoop Book. Appendix A of the book describes how to install Hadoop. Though, there is also a hint to use a virtual machine (VM) which comes with a pre-configured, single-node Hadoop cluster. In particular the book refers to the Cloudera Quickstart VM.

Cloudera is a company which provides Hadoop-based software, support and services. The Cloudera Hadoop distribution is called ‘Cloudera’s Distribution including Apache Hadoop (CDH)’. Cloudera is for Hadoop similar to what RedHat or SUSE are for Linux. Please note that there are many other Hadoop-based offerings.

Step 1 – Download the VM: Download the Cloudera Quickstart VM here. At the time of writing this post, the Cloudera Quickstart VM is available for Docker, KVM, VMWare and VirtualBox. I have downloaded the VM for VirtualBox with CDH 5.7. The size of the zipped VM is about 5GB, so the download takes a few minutes. The Cloudera Quickstart VM needs to be extracted in order to import it into VirtualBox.Step-1Step 2 – Import the VM: VirtualBox is a free and easy to use hypervisor for x86 processors provided by Oracle. It runs on Linux, Solaris, OS X and Windows. Open the ‘File->Import Appliance’ menu of the VirtualBox GUI to import the extracted VM.Step-2a

Browse to the directory where the VM was extracted, select it and press the ‘Open’ button.Step-2b

Walk through the next panels without changing any settings. The last screen shows the configuration of the virtual server. Press the ‘Import’ button to import the VM.Step-2c

It takes a view minutes to complete the import of the VM.Step-2d

Step 3 – Start the VM: Next the VM needs to be started. Select the QuickStart VM and press the ‘Start’ button with the green arrow at the top of the VirtualBox GUI.Step-3a

The booting of the Cloudera QuickStartVM takes some time, because a lot of processes will be started. Finally the Linux desktop appears and shows a webpage with Hadoop cluster status and instructions for the QuickStart VM. I skip the Cloudera tutorial, because I want to try the examples in the Hadoop Book. Maybe I look into the tutorial later.Step-3b

Step 4 – Enabling SSH: I prefer to work on the VM via Putty and SSH. A port forwarding rule for SSH needs to be configured to allow SSH connections from my laptop to the virtual machine. Select the VM in VirtualBox and click on ‘Network’.Step-4a

Expand the ‘Advanced’ options Press the ‘Port Forwarding’ button.Step-4b

The next panel allows to specify port forwarding rules. Press the symbol in the upper right corner to add an additional rule. In my example I mapped port 2000/tcp of my laptop to port 22/tcp of the VM. Port 22/tcp is the default port for SSH. Click the ‘OK’ button to activate the new rule. Step-4c

Step 5 – SSH login: I use Putty to SSH from my Windows laptop to Linux server. Open Putty, enter IP address (127.0.0.1) and port (2000). The port needs to be the same as the Host Port specified in the previous step. Press the ‘Open’ button to start an SSH session to the VM.Step-5a

Putty opens an SSH session and I can easily log on as user ‘cloudera’ using password ‘cloudera’.Step-5b
A quick test shows that Hadoop is ready to use. The description for classpath and credential seems to be mixed up. Thought this is what I got. I checked twice.

[cloudera@quickstart ~]$ which hadoop
/usr/bin/hadoop

[cloudera@quickstart ~]$ hadoop
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  trace                view and modify Hadoop tracing settings
 or
  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

[cloudera@quickstart ~]$

Ulf’s Conclusion

The Cloudera QuickInstall VM is indeed easy to install. I am happy that I now have a system, where I can try the exercises of the Hadoop book on my own.

In the next post I will explain the basics of MapReduce.

Changes:
2016/09/09 – added link – “basics of MapReduce” => BigData Investigation 4 – MapReduce Explained

Share this article

Comments 5

Leave a Reply to Ulf Troppens Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.