In this post I will show how to install the Cloudera Quickstart VM on VirtualBox. I need a Hadoop cluster to try the examples in the Hadoop Book. Appendix A of the book describes how to install Hadoop. Though, there is also a hint to use a virtual machine (VM) which comes with a pre-configured, single-node Hadoop cluster. In particular the book refers to the Cloudera Quickstart VM.
Cloudera is a company which provides Hadoop-based software, support and services. The Cloudera Hadoop distribution is called ‘Cloudera’s Distribution including Apache Hadoop (CDH)’. Cloudera is for Hadoop similar to what RedHat or SUSE are for Linux. Please note that there are many other Hadoop-based offerings.
Step 1 – Download the VM: Download the Cloudera Quickstart VM here. At the time of writing this post, the Cloudera Quickstart VM is available for Docker, KVM, VMWare and VirtualBox. I have downloaded the VM for VirtualBox with CDH 5.7. The size of the zipped VM is about 5GB, so the download takes a few minutes. The Cloudera Quickstart VM needs to be extracted in order to import it into VirtualBox.Step 2 – Import the VM: VirtualBox is a free and easy to use hypervisor for x86 processors provided by Oracle. It runs on Linux, Solaris, OS X and Windows. Open the ‘File->Import Appliance’ menu of the VirtualBox GUI to import the extracted VM.
Browse to the directory where the VM was extracted, select it and press the ‘Open’ button.
Walk through the next panels without changing any settings. The last screen shows the configuration of the virtual server. Press the ‘Import’ button to import the VM.
It takes a view minutes to complete the import of the VM.
Step 3 – Start the VM: Next the VM needs to be started. Select the QuickStart VM and press the ‘Start’ button with the green arrow at the top of the VirtualBox GUI.
The booting of the Cloudera QuickStartVM takes some time, because a lot of processes will be started. Finally the Linux desktop appears and shows a webpage with Hadoop cluster status and instructions for the QuickStart VM. I skip the Cloudera tutorial, because I want to try the examples in the Hadoop Book. Maybe I look into the tutorial later.
Step 4 – Enabling SSH: I prefer to work on the VM via Putty and SSH. A port forwarding rule for SSH needs to be configured to allow SSH connections from my laptop to the virtual machine. Select the VM in VirtualBox and click on ‘Network’.
Expand the ‘Advanced’ options Press the ‘Port Forwarding’ button.
The next panel allows to specify port forwarding rules. Press the symbol in the upper right corner to add an additional rule. In my example I mapped port 2000/tcp of my laptop to port 22/tcp of the VM. Port 22/tcp is the default port for SSH. Click the ‘OK’ button to activate the new rule.
Step 5 – SSH login: I use Putty to SSH from my Windows laptop to Linux server. Open Putty, enter IP address (127.0.0.1) and port (2000). The port needs to be the same as the Host Port specified in the previous step. Press the ‘Open’ button to start an SSH session to the VM.
Putty opens an SSH session and I can easily log on as user ‘cloudera’ using password ‘cloudera’.
A quick test shows that Hadoop is ready to use. The description for classpath and credential seems to be mixed up. Thought this is what I got. I checked twice.
[cloudera@quickstart ~]$ which hadoop /usr/bin/hadoop [cloudera@quickstart ~]$ hadoop Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file checknative [-a|-h] check native hadoop and compression libraries availability distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive classpath prints the class path needed to get the credential interact with credential providers Hadoop jar and the required libraries daemonlog get/set the log level for each daemon trace view and modify Hadoop tracing settings or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters. [cloudera@quickstart ~]$
The Cloudera QuickInstall VM is indeed easy to install. I am happy that I now have a system, where I can try the exercises of the Hadoop book on my own.
In the next post I will explain the basics of MapReduce.
2016/09/09 – added link – “basics of MapReduce” => BigData Investigation 4 – MapReduce Explained