There are plenty of online courses available which introduce Hadoop. Though as old hand I prefer a book. I browsed in my preferred online book store and ordered “Hadoop: The Definite Guide” by Tom White.
I chose this book for several reasons. First, the book provides a plenty of code examples which can be downloaded from GitHub. Second, Appendix A includes detailed instructions on how to install Hadoop on a single machine. Both are very important for me, because I want to try many examples on a live system. Third, the Hadoop Distributed Filesystem (HDFS) is covered in detail in one of the first chapters of the book. This is an additional plus, given my interest in the storage aspects of BigData.
The book is structured in five parts. Part I introduces the fundamental components: MapReduce, HDFS, YARN and Hadoop I/O. This is a nice surprise. I heard about Pig, Hive, Flume, Spark, HBase, Oozie, ZooKeeper and a lot of other stuff with fancy names. I am relieved that only four components are required to get started with Hadoop. Part II discusses MapReduce in depth and Part III completes the basics by describing installation and administration of Hadoop clusters.
Part IV and V teach advanced topics. Part IV goes into the details of the above mentioned additional components (e.g. Pig, Hive, Spark), ten chapters, each dedicated to a different component. Part V presents some interesting case studies on health care and life science. Finally, an Appendix provides supplemental information.
I already took a quick glance at Part I. The MapReduce chapter includes a plenty of examples which I want to try on a live system, so I urgently need to setup my own Hadoop cluster.
In the next post I will explain how to setup a single-node Hadoop cluster.
2016/09/02 – added link – “how to setup a single-node Hadoop cluster” => BigData Investigation 3 – Installing the Cloudera QuickStart VM on VirtualBox