I released the intermediate package of my Test With Spring course. Take a look at the course >>

Install and Configure Apache Hadoop to Run in a Pseudo-Distributed Mode

This blog entry describes how we can install Apache Hadoop 1.0.3 from a binary distribution to a computer that runs Unix-like operating system. We also learn how we can configure Apache Hadoop to run in a pseudo-distributed mode. Finally, we learn how we can start and stop the Hadoop daemons.

Installing Apache Hadoop from a Binary Distribution

This Section describes how we can install Apache Hadoop 1.0.3 from a binary distribution to a computer running Unix-like operating system. However, these instructions can also be applied when we are installing Apache Hadoop to a computer running Windows. Also, we should remember that the Windows version of Apache Hadoop should be used only for development purposes since the distributed operation mode has not been tested on Windows.

Requirements

Before we can start the installation of Apache Hadoop, we have to ensure that the software required by Apache Hadoop is installed. These requirements are described in following:

  • Java 1.6.X must be installed. Oracle’s version is preferred.
  • SSH must be installed and sshd must be running if the pseudo-distributed mode is used.

Note: If Apache Hadoop is installed to a computer that runs Windows operating system, Cygwin must be installed. Check out the installation instructions of Cygwin for more details. If pseudo-distributed mode is used, openssh must be selected from the Net category during Cygwin installation. After Cygwin is installed, setup a SSH server by running the command ssh-host-config -y at command prompt.

Installation from a Binary Package

We can install Apache Hadoop 1.0.3 from a binary package by following these steps:

  1. Download the binary distribution of Apache Hadoop 1.0.3
  2. Unpack the binary distribution
  3. Set the environment variables

Our first step is to download binary distribution and unpack it. We can do this by running the following commands at command prompt:

wget www.us.apache.org/dist/hadoop/common/hadoop-1.0.3/hadoop-1.0.3-bin.tar.gz
tar xzf hadoop-1.0.3-bin.tar.gz

After we have unpacked the binary distribution, we have to set the values of required environment variables. These environment variables are described in following:

  • JAVA_HOME describes the home directory of the used JDK. We must use the conf/hadoop_env.sh file for setting the value of this environment variable. This file can be used to set other environment variables as well.
  • HADOOP_HOME describes the home directory of our Apache Hadoop installation.
  • HADOOP_LOG_DIR describes the directory in which the log files of Apache Hadoop are written. The default log directory is $(HADOOP_HOME)/logs.
  • PATH describes the directories where executable binaries are located. We should add the $(HADOOP_HOME)/bin directory to our path.

Configuring Apache Hadoop’s Pseudo-Distributed Mode

By default, Apache Hadoop is running in a standalone mode that be used for debugging purposes. However, it is also possible to run Apache Hadoop on a single node by using so called pseudo-distributed mode. This means that each Hadoop daemon runs in a separate Java process. We can configure Apache Hadoop to run on a pseudo-distributed mode by following these steps:

  1. Configure SSHD
  2. Configure Apache Hadoop

Configuring SSHD

We have to ensure that Apache Hadoop can open SSH connections to localhost without using a password. We can test if this is possible by running the following command at command prompt:

ssh localhost

If we have not already configured key based authentication, our connection attempt should fail. We can fix this by by following these steps:

  1. Create a new SSH key with empty password
  2. Add the created key to authorized keys

We can do this this by running following commands at command prompt:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

After this is done, we should verify that everything is working by trying to open an SSH connection to localhost without using password. If everything works as expected, we are ready to move to the next step of the configuration process.

Configuring Apache Hadoop

There are three configuration files that we should be interested in when we are configuring Apache Hadoop to run in a pseudo-distributed mode (The configuration files are found from the conf directory). These configuration files are described in following:

We can make the required changes to the default configuration of Apache Hadoop by following these steps:

  1. Set the default file system of Apache Hadoop.
  2. Set the replication factor of HDFS to one. This states that by default only one copy of each file is stored in HDFS.
  3. Configure the JobTracker.

Setting the Default File System

The configuration of the Apache Hadoop’s default file system must be placed in the conf/core-site.xml configuration file. We can configure the default file system by setting the value of the fs.default.name property. The content of the conf/core-site.xml configuration file is given in following:

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>

Setting the Replication Factor of HDFS

The configuration settings of HDFS daemons are found from the conf/hdfs-site.xml configuration file. We can set the replication factor of HDFS by setting the value of the dfs.replication property. The content of the conf/hdfs-site.xml file looks following:

<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>

Configuring the JobTracker

The configuration of the MapReduce daemons is found from the conf/mapred-site.xml configuration file. We can configure the JobTracker by setting the value of the mapred.job.tracker property. The content of the conf/mapred-site.xml file is given in following:

<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>

Running Apache Hadoop in Pseudo-Distributed Mode

When we are running Apache Hadoop for the first time in the pseudo-distributed mode, we have to format a new distributed file system before we can start the Apache Hadoop daemons. We can do this by running the following command at command prompt:

hadoop namenode –format

That is it. We can now start using our Apache Hadoop instance in a pseudo-distributed mode. The essential scripts are described in following:

  • bin/start-all.sh starts all Apache Hadoop daemons.
  • bin/stop-all.sh stops all Apache Hadoop daemons.

Note: If we are running Apache Hadoop on OS X Lion or Mountain Lion, we might have to make certain changes to the Hadoop configuration files because of a known JRE bug. The fix of this problem is described in a blog entry How To Set up Hadoop on OS X Lion 10.7.

Apache Hadoop provides web interfaces that can be used to monitor the current state of the cluster and access its log information. These interfaces are described in following:

  • The web interface of the NameNode displays the DataNodes of the cluster and provides some basic statistics about the cluster. It also provides a filesystem browser. This web interface is found from http://localhost:50070.
  • The web interface of the JobTracker can be used to track the progress of running jobs, and to check the statistics and logs of finished jobs. The default url address of this interface is http://localhost:50030.

Cool, But What Can We Do with This Thing?

The homepage of Apache Hadoop provides an answer to this question:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

WTF? So, what can we really do with this thing? I will answer to this question in my next blog entry that has a concrete example describing how we can create Hadoop MapReduce jobs with Spring Data Apache Hadoop.

About the Author

Petri Kainulainen is passionate about software development and continuous improvement. He is specialized in software development with the Spring Framework and is the author of Spring Data book.

About Petri Kainulainen →

4 comments… add one
  • Hi, Thanks for this blog post. While trying it out I noticed that the linked blog post – “How To Set up Hadoop on OS X Lion 10.7” seems to be no longer available.

    Found this other blog post which helped me setup on OSX.
    http://www.cnblogs.com/yaoliang11/archive/2013/05/06/3062942.html

    Reply
    • Thanks for letting me know about the broken link. I replaced it with the link which you provided.

      Reply
  • For newer versions of OS X like 10.9 setting PATH described is not working.

    Found a way where you can add default PATH environment for MAC in file /etc/paths.

    Open this file using Terminal in SUDO mode.
    $ sudo nano /etc/paths (enter password when prompted).
    Append the path in below format.
    /users/hadoop/hadoop-1.2.1/bin
    /users/hadoop/hadoop-1.2.1/sbin

    save the file and restart the Machine. Next time no need to type whole command to run Hadoop script command from Script.

    Reply
    • Thank you for sharing your tip! I am sure that it is useful to someone.

      Reply

Leave a Comment