This blog entry describes how we can install Apache Hadoop 1.0.3 from a binary distribution to a computer that runs Unix-like operating system. We also learn how we can configure Apache Hadoop to run in a pseudo-distributed mode. Finally, we learn how we can start and stop the Hadoop daemons.
Installing Apache Hadoop from a Binary Distribution
This Section describes how we can install Apache Hadoop 1.0.3 from a binary distribution to a computer running Unix-like operating system. However, these instructions can also be applied when we are installing Apache Hadoop to a computer running Windows. Also, we should remember that the Windows version of Apache Hadoop should be used only for development purposes since the distributed operation mode has not been tested on Windows.
Before we can start the installation of Apache Hadoop, we have to ensure that the software required by Apache Hadoop is installed. These requirements are described in following:
- Java 1.6.X must be installed. Oracle’s version is preferred.
- SSH must be installed and sshd must be running if the pseudo-distributed mode is used.
Note: If Apache Hadoop is installed to a computer that runs Windows operating system, Cygwin must be installed. Check out the installation instructions of Cygwin for more details. If pseudo-distributed mode is used, openssh must be selected from the Net category during Cygwin installation. After Cygwin is installed, setup a SSH server by running the command ssh-host-config -y at command prompt.
Installation from a Binary Package
We can install Apache Hadoop 1.0.3 from a binary package by following these steps:
- Download the binary distribution of Apache Hadoop 1.0.3
- Unpack the binary distribution
- Set the environment variables
Our first step is to download binary distribution and unpack it. We can do this by running the following commands at command prompt:
wget www.us.apache.org/dist/hadoop/common/hadoop-1.0.3/hadoop-1.0.3-bin.tar.gz tar xzf hadoop-1.0.3-bin.tar.gz
After we have unpacked the binary distribution, we have to set the values of required environment variables. These environment variables are described in following:
- JAVA_HOME describes the home directory of the used JDK. We must use the conf/hadoop_env.sh file for setting the value of this environment variable. This file can be used to set other environment variables as well.
- HADOOP_HOME describes the home directory of our Apache Hadoop installation.
- HADOOP_LOG_DIR describes the directory in which the log files of Apache Hadoop are written. The default log directory is $(HADOOP_HOME)/logs.
- PATH describes the directories where executable binaries are located. We should add the $(HADOOP_HOME)/bin directory to our path.
Configuring Apache Hadoop’s Pseudo-Distributed Mode
By default, Apache Hadoop is running in a standalone mode that be used for debugging purposes. However, it is also possible to run Apache Hadoop on a single node by using so called pseudo-distributed mode. This means that each Hadoop daemon runs in a separate Java process. We can configure Apache Hadoop to run on a pseudo-distributed mode by following these steps:
- Configure SSHD
- Configure Apache Hadoop
We have to ensure that Apache Hadoop can open SSH connections to localhost without using a password. We can test if this is possible by running the following command at command prompt:
If we have not already configured key based authentication, our connection attempt should fail. We can fix this by by following these steps:
- Create a new SSH key with empty password
- Add the created key to authorized keys
We can do this this by running following commands at command prompt:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
After this is done, we should verify that everything is working by trying to open an SSH connection to localhost without using password. If everything works as expected, we are ready to move to the next step of the configuration process.
Configuring Apache Hadoop
There are three configuration files that we should be interested in when we are configuring Apache Hadoop to run in a pseudo-distributed mode (The configuration files are found from the conf directory). These configuration files are described in following:
- core-site.xml contains the properties of Apache Hadoop Core. These properties are common to both MapReduce and HDFS. A list of available configuration properties is also available.
- hdfs-site.xml contains the properties of HDFS. Apache Hadoop’s documentation provides a list of valid configuration properties.
- mapred-site.xml contains the properties of MapReduce. A full list of configuration properties is found from the documentation of Apache Hadoop.
We can make the required changes to the default configuration of Apache Hadoop by following these steps:
- Set the default file system of Apache Hadoop.
- Set the replication factor of HDFS to one. This states that by default only one copy of each file is stored in HDFS.
- Configure the JobTracker.
Setting the Default File System
The configuration of the Apache Hadoop’s default file system must be placed in the conf/core-site.xml configuration file. We can configure the default file system by setting the value of the fs.default.name property. The content of the conf/core-site.xml configuration file is given in following:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Setting the Replication Factor of HDFS
The configuration settings of HDFS daemons are found from the conf/hdfs-site.xml configuration file. We can set the replication factor of HDFS by setting the value of the dfs.replication property. The content of the conf/hdfs-site.xml file looks following:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Configuring the JobTracker
The configuration of the MapReduce daemons is found from the conf/mapred-site.xml configuration file. We can configure the JobTracker by setting the value of the mapred.job.tracker property. The content of the conf/mapred-site.xml file is given in following:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
Running Apache Hadoop in Pseudo-Distributed Mode
When we are running Apache Hadoop for the first time in the pseudo-distributed mode, we have to format a new distributed file system before we can start the Apache Hadoop daemons. We can do this by running the following command at command prompt:
hadoop namenode –format
That is it. We can now start using our Apache Hadoop instance in a pseudo-distributed mode. The essential scripts are described in following:
- bin/start-all.sh starts all Apache Hadoop daemons.
- bin/stop-all.sh stops all Apache Hadoop daemons.
Note: If we are running Apache Hadoop on OS X Lion or Mountain Lion, we might have to make certain changes to the Hadoop configuration files because of a known JRE bug. The fix of this problem is described in a blog entry How To Set up Hadoop on OS X Lion 10.7.
Apache Hadoop provides web interfaces that can be used to monitor the current state of the cluster and access its log information. These interfaces are described in following:
- The web interface of the NameNode displays the DataNodes of the cluster and provides some basic statistics about the cluster. It also provides a filesystem browser. This web interface is found from http://localhost:50070.
- The web interface of the JobTracker can be used to track the progress of running jobs, and to check the statistics and logs of finished jobs. The default url address of this interface is http://localhost:50030.
Cool, But What Can We Do with This Thing?
The homepage of Apache Hadoop provides an answer to this question:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
WTF? So, what can we really do with this thing? I will answer to this question in my next blog entry that has a concrete example describing how we can create Hadoop MapReduce jobs with Spring Data Apache Hadoop.