Swathi V

About Swathi V

Loves Art and Technology! Would like to blog and share.. Involved in Apache Hadoop and its ecosystem. Eager to be a part of Big Data Revolution.

Hadoop Hangover: Launch a hadoop cluster CDH4 using Apache Whirr

This post is about how-to launch a CDH4 MRv1 or CDH4 Yarn cluster on EC2 instances. It’s said that you can launch a cluster with the help of Whirr and in a matter of 5 minutes! This is very true if and only if everything works out well!

Hopefully, this article helps you in that regard.

So, let’s row the boat…

  • Download the stable version of Apache Whirr ie. whirr-0.8.1.tar.gz from the following link whirr-0.8.1.tar.gz
  • Extract from the tarball and generate the key
  • $ tar -xzvf whirr-0.8.1.tar.gz
    $ cd whirr-0.8.1
  • Generate the key
  • $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
    $ cd whirr-0.8.1 
  • Make a properties file to launch the cluster with that configuration.
  • # Cluster name goes here
    whirr.cluster-name=testcluster
     
    # Change the number of machines in the cluster here
    # Using 3 DN and TT and 1JT and NN# Ganglia is configured
    whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode+ganglia-monitor+ganglia-metad,3 hadoop-datanode+hadoop-tasktracker+ganglia-monitor
     
    # Install JAVA
    whirr.java.install-function=install_openjdk
    whirr.java.install-function=install_oab_java
     
    ## Install CDH4 MRV1
    whirr.hadoop.install-function=install_cdh_hadoop
    whirr.hadoop.configure-function=configure_cdh_hadoop
    whirr.env.REPO=cdh4
     
    # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
    whirr.provider=aws-ec2
    whirr.hardware-id=c1.xlarge
     
    # Credentials should go here
    whirr.identity=XXXXXXXXXXXXXXXXX
    whirr.credential=XXXXXXXXXXXXXXXXXXXX
    whirr.cluster-user=whirr
    whirr.private-key-file=/home/ubuntu/.ssh/yourKey
    whirr.public-key-file=/home/ubuntu/.ssh/yourKey.pub
  • Now let me tell you how to avoid getting headaches!
    • cluster name: Keep your cluster name simple. Avoid testCluster, testCluster1 etc. ie. No Caps, numerics..
    • Decide on the number of datanodes you want judiciously.
    • Your launch may not be successful, if java is not installed. Make sure the image has Java. However, this properties file takes care of that.
    • It will be good to go ahead with MRv1 for now and later switch to MRv2, when we get a production stable release.
    • This is the minimal set of configurations for launching a Hadoop cluster. But, you can do a lot performance tuning upon this.
    • I had launched this cluster from an ec2 instance, Initially i faced errors, regarding user. Setting the configuration below, solved the problem.
    • whirr.cluster-user=whirr
    • Set proper permissions for ~/.ssh and whirr-0.8.1 folder before launching.
  • Well, we are ready to launch the cluster. Name the properties file as ‘whirr_cdh.properties’.
  • $ cd whirr-0.8.1
    $ bin/whirr launch-cluster --config whirr_cdh.properties

In the console you can see, links to Namenode and JobTracker Web UI. It also prints how to ssh to the instances in the end.

  • Now, you should be having the files generated. You will be able to see these files: instances, hadoop-proxy.sh and hadoop-site.xml
  • Starting the proxy
  • $ sh hadoop-proxy.sh
  • Open another terminal, and type
  • You should be able to access the HDFS.
  • $ export HADOOP_CONF_DIR=~/.whirr/testcluster/hadoop-site.xml
    $ hadoop fs -ls /
  • You can alternatively download hadoop tarball and launch with
  • $ bin/hadoop --config ~/.whirr/testcluster fs -ls /
  • Okay! So I know that you will not be satisfied unless you a web UI
  • Now, Launch Firefox (3.0v+)
    Download the FoxyProxy extension by clicking this link:https://addons.mozilla.org/en-US/firefox/addon/2464.
    Steps to configure and access the UI
    Select Tools > FoxyProxy > Options
    Click the “Add New Proxy” button.
    Select “Manual Proxy Configuration”
    Enter “localhost” for the “Host or IP Address” field.
    Enter “6666″ for the “Port” field.
    Click on the “General” tab at the top of the dialog box.
    Enter “EC2″ for the “Proxy Name” field.
    Click on the “URL Patterns” tab at the top of the dialog box.
    Click the “Add New Pattern” button.
    Enter “EC2″ for the “Pattern Name” field.
    Enter “*compute-1.amazonaws.com*, *.ec2.internal*, *.compute-1.internal*” for the “URL pattern” field (not case sensitive)
    Select the “Whitelist” and “Wildcards” radio buttons.
    Click the “OK” button to dismiss the new URL pattern dialog box.
    Click the “OK” button to dismiss the new proxy dialog box.
    Completely disable the Foxyproxy for now.
    You should be able to see 2 proxy names after closing, default and EC2.
    Click on “Use proxy EC2 for all URLs” from the pop-up menu of FoxyProxy
    Copy the URL of JobTracker (can be seen while running proxy, ec2-***-**-***-**.********.amazonaws.com) and paste it in the browser.

So, we are good to go!

  • If you want to launch MRv2, use this.
  • ## Cluster name goes here.
    whirr.cluster-name=yarncluster
     
    # Change the number of machines in the cluster here
    whirr.instance-templates=1 hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver,2 hadoop-datanode+yarn-nodemanager
     
    # Install JAVA
    whirr.java.install-function=install_openjdk
    whirr.java.install-function=install_oab_java
     
    ## Install CDH4 Yarn
    whirr.hadoop.install-function=install_cdh_hadoop
    whirr.hadoop.configure-function=configure_cdh_hadoop
    whirr.yarn.configure-function=configure_cdh_yarn
    whirr.yarn.start-function=start_cdh_yarn
    whirr.mr_jobhistory.start-function=start_cdh_mr_jobhistory
    whirr.env.REPO=cdh4
    whirr.env.MAPREDUCE_VERSION=2
     
    # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
    whirr.provider=aws-ec2
    whirr.hardware-id=c1.xlarge
     
    # Credentials should go here
    whirr.identity=XXXXXXXXXXXXXXXXX
    whirr.credential=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    whirr.cluster-user=whirr
    whirr.private-key-file=/home/ubuntu/.ssh/yourKey
    whirr.public-key-file=/home/ubuntu/.ssh/yourKey.pub

and the same process!

Happy Learning!
 

Reference: Hadoop Hangover: Launch a hadoop cluster CDH4 using Apache Whirr from our JCG partner Swathi V at the * Techie(S)pArK * blog.

Related Whitepaper:

Hadoop Illuminated

Gentle Introduction of Hadoop and Big Data!

This Hadoop book was written with following goals and principles: Make Hadoop accessible to a wider audience -- not just the highly technical crowd. There are a few unique chapters that you won't find in other Hadoop books, for example: Hadoop use cases, Hadoop distributions rundown, BI Tools feature matrix.

Get it Now!  

2 Responses to "Hadoop Hangover: Launch a hadoop cluster CDH4 using Apache Whirr"

  1. Gello Swathi!

    We have fixed the article, we have added all code snippets.

    Sorry for the inconvenience!

Leave a Reply


× 7 = twenty eight



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books