This tutorial describes how we can create a Hadoop MapReduce Job with Spring Data Apache Hadoop. As an example we will analyze the data of a novel called The Adventures of Sherlock Holmes and find out how many times the last name of Sherlock’s loyal sidekick Dr. Watson is mentioned in the novel.
Note: This blog entry assumes that we have already installed and configured the used Apache Hadoop instance.
We can create a Hadoop MapReduce Job with Spring Data Apache Hadoop by following these steps:
- Get the required dependencies by using Maven.
- Create the mapper component.
- Create the reducer component.
- Configure the application context.
- Load the application context when the application starts.
These steps are explained with more details in the following Sections. We will also learn how we can run the created Hadoop job.
Getting the Required Dependencies with Maven
We can download the required dependencies with Maven by adding the dependency declations of Spring Data Apache Hadoop and Apache Hadoop Core to our POM file. We can declare these dependencies by adding the following lines to our pom.xml file:
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop</artifactId>
<version>1.0.0.RELEASE</version>
</dependency>
<!-- Apache Hadoop Core -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.0.3</version>
</dependency>
Creating the Mapper Component
A mapper is a component that divides the original problem into smaller problems that are easier to solve. We can create a custom mapper component by extending the Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> class and overriding its map() method. The type parameters of the Mapper class are described in following:
- KEYIN describes the type of the key that is provided as an input to the mapper component.
- VALUEIN describes the type of the value that is provided as an input to the mapper component.
- KEYOUT describes the type of the mapper component’s output key.
- VALUEOUT describes the type of the mapper component’s output value.
Each type parameter must implement the Writable interface. Apache Hadoop provides several implementations to this interface. A list of existing implementations is available at the API documentation of Apache Hadoop.
Our mapper processes the contents of the input file one line at the time and produces key-value pairs where the key is a single word of the processed line and the value is always one. Our implementation of the map() method has following steps:
- Split the given line into words.
- Iterate through each word and remove all Unicode characters that are not either letters or numbers.
- Create an output key-value pair by calling the write() method of the Mapper.Context class and providing the required parameters.
The source code of the WordMapper class looks following:
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer lineTokenizer = new StringTokenizer(line);
while (lineTokenizer.hasMoreTokens()) {
String cleaned = removeNonLettersOrNumbers(lineTokenizer.nextToken());
word.set(cleaned);
context.write(word, new IntWritable(1));
}
}
/**
* Replaces all Unicode characters that are not either letters or numbers with
* an empty string.
* @param original The original string.
* @return A string that contains only letters and numbers.
*/
private String removeNonLettersOrNumbers(String original) {
return original.replaceAll("[^\\p{L}\\p{N}]", "");
}
}
Creating the Reducer Component
A reducer is a component that removes the unwanted intermediate values and passes forward only the relevant key-value pairs. We can implement our reducer by extending the Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> class and overriding its reduce() method. The type parameters of the Reducer class are described in following:
- KEYIN describes the type of the key that is provided as an input to the reducer. The value of this type parameter must match with the KEYOUT type parameter of the used mapper.
- VALUEIN describes the type of the value that is provided as an input to the reducer component. The value of this type parameter must match with the VALUEOUT type parameter of the used mapper.
- KEYOUT describes type of the output key of the reducer component.
- VALUEOUT describes the type of the output key of the reducer component.
Our reducer processes each key-value pair produced by our mapper and creates a key-value pair that contains the answer of our question. We can implement the reduce() method by following these steps:
- Verify that the input key contains the wanted word.
- If the key contains the wanted word, count how many times the word was found.
- Create a new output key-value pair by calling the write() method of the Reducer.Context class and providing the required parameters.
The source code of the WordReducer class is given in following:
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
protected static final String TARGET_WORD = "Watson";
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
if (containsTargetWord(key)) {
int wordCount = 0;
for (IntWritable value: values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
private boolean containsTargetWord(Text key) {
return key.toString().equals(TARGET_WORD);
}
}
Configuring the Application Context
Because Spring Data Apache Hadoop 1.0.0.M2 does not support Java configuration, we have to configure the application context of our application by using XML. We can configure the application context of our application by following these steps:
- Create a properties file that contains the values of configuration properties.
- Configure a property placeholder that fetches the values of configuration properties from the created property file.
- Configure Apache Hadoop.
- Configure the executed Hadoop job.
- Configure the job runner that runs the created Hadoop job.
Creating the Properties File
Our properties file contains the values of our configuration parameters. We can create this file by following these steps:
- Specify the value of the fs.default.name property. The value of this property must match with the configuration of our Apache Hadoop instance.
- Specify the value of the mapred.job.tracker property. The value of this property must match with the configuration of our Apache Hadoop instance.
- Specify the value of the input.path property.
- Add the value of the output.path property to the properties file.
The contents of the application.properties file looks following:
mapred.job.tracker=localhost:9001
input.path=/input/
output.path=/output/
Configuring the Property Placeholder
We can configure the needed property placeholder by adding the following element to the applicationContext.xml file:
Configuring Apache Hadoop
We can use the configuration namespace element for providing configuration parameters to Apache Hadoop. In order to execute our job by using our Apache Hadoop instance, we have to configure the default file system and the JobTracker. We can configure the default file system and the JobTracker by adding the following element to the applicationContext.xml file:
fs.default.name=${fs.default.name}
mapred.job.tracker=${mapred.job.tracker}
</hdp:configuration>
Configuring the Hadoop Job
We can configure our Hadoop job by following these steps:
- Configure the input path that contains the input files of the job.
- Configure the output path of the job.
- Configure the name of the main class.
- Configure the name of the mapper class.
- Configure the name of the reducer class.
Note: If the configured output path exists, the execution of the Hadoop job fails. This is a safety mechanism that ensures that the results of a MapReduce job cannot be overwritten accidentally.
We have to add the following job declaration to our application context configuration file:
input-path="${input.path}"
output-path="${output.path}"
jar-by-class="net.petrikainulainen.spring.data.apachehadoop.Main"
mapper="net.petrikainulainen.spring.data.apachehadoop.WordMapper"
reducer="net.petrikainulainen.spring.data.apachehadoop.WordReducer"/>
Configuring the Job Runner
The job runner is responsible of executing the jobs after the application context has been loaded. We can configure our job runner by following these steps:
- Configure the job runner.
- Configure the executed jobs.
- Configure the job runner to run the configured jobs when it is started.
The declaration of our job runner bean is given in following:
Loading the Application Context When the Application Starts
We can execute the created Hadoop job by loading the application context when our application is started. We can do this by creating a new ClasspathXmlApplicationContext object and providing the name of our application context configuration file as a constructor parameter. The source code of our Main class is given in following:
import org.springframework.context.support.ClassPathXmlApplicationContext;
public class Main {
public static void main(String[] arguments) {
ApplicationContext ctx = new ClassPathXmlApplicationContext("applicationContext.xml");
}
}
Running the MapReduce Job
We have now learned how we can create a Hadoop MapReduce job with Spring Data Apache Hadoop. Our next step is to execute the created job. The first thing we have to do is to download The Adventures of Sherlock Holmes. We must download the plain text version of this novel manually since the website of Project Gutenberg is blocking download utilities such as wget.
After we have downloaded the input file, we are ready to run our MapReduce job. We can run the created job by starting our Apache Hadoop instance in a pseudo-distributed mode and following these steps:
- Upload our input file to HDFS.
- Run our MapReduce job.
Uploading the Input File to HDFS
Our next step is to upload our input file to HDFS. We can do this by running the following command at command prompt:
We can check that everything went fine by running the following command at command prompt:
If the file was uploaded successfully, we should see the following directory listing:
-rw-r--r-- 1 xxxx supergroup 594933 2012-08-05 12:07 /input/pg1661.txt
Running Our MapReduce Job
We have two alternative methods for running our MapReduce job:
- We can execute the main() method of the Main class from our IDE.
- We can build a binary distribution of our example project by running a command mvn assembly:assembly at command prompt. This creates a zip package to the target directory. We can run the created MapReduce job by unzipping this package and using the provided startup scripts.
Note: If you are not familiar with the Maven assembly plugin, you might want to read my blog entry that describes how you can create a runnable binary distribution with the Maven assembly plugin.
The outcome of our MapReduce job does not depend from the method that is used to run it. The outcome of our job should be written to the configured outcome directory of HDFS.
Note: If the execution of our MapReduce job fails because the output directory exists, we can delete the output directory by running the following command at command prompt:
We can check the output of our job by running the following command at command prompt:
This command lists the files found from the /output directory of HDFS. If everything went fine, we should see a similar directory listing:
-rw-r--r-- 3 xxxx supergroup 0 2012-08-05 12:31 /output/_SUCCESS
-rw-r--r-- 3 xxxx supergroup 10 2012-08-05 12:31 /output/part-r-00000
Now we will finally find out the answer to our question. We can get the answer by running the following command at command prompt:
If everything went fine, we should see following output:
We now know that the last name of doctor Watson was mentioned 81 times in the novel The Adventures of Sherlock Holmes.
What is Next?
My next blog entry about Apache Hadoop describes how we can create a streaming MapReduce job by using Hadoop Streaming and Spring Data Apache Hadoop.
PS. A fully functional example application that was described in this blog entry is available at Github.


Konfusius August 27, 2012 at 9:22 am
My background: I know Maven and did some Projects at the university, but I’m new to Spring and Hadoop, so I needed to take a closer look at the spring configuration part.
Here are a few questions and hints which might improve your great article:
– “…Holmes and find out how many the last name of Sherlock’s…” -> “…Holmes and find out how many ”’times”’ the last name of Sherlock’s…”
– maybe create the maven project with the artifact:generate command, and add the dependencies afterwards
mvn archetype:generate \
-DarchetypeArtifactId=maven-archetype-quickstart \
-DgroupId=com.company.division \
-DartifactId=appName \
-Dversion=1.0-SNAPSHOT \
-Dpackage=com.company.division.appName
– Creating the Mapper Component: You could add a note that we will configure the location of the classes in the application.xml afterwards + a proposal to create a package for those classes
– Path for the files applicationContext.xml and application.properties: You could add a note that the path to the applicationContext.xml will be configured in the Main class afterwards and tat the convention is that the file goes into /src/main/ressources/META-INF/applicationContext.xml
Links to basic explanations for the bare minimum of a applicationContext skeleton might be usefull, including the needed xmlns definitions.
– funny typo: the “napper class”
– for the “hadoop dfs –ls /input” command you have used a “–” instead of “-” (I don’t know the english terms) which will give copypasting readers some headache. (same for “hadoop dfs –cat”)
– A Link to Instructions for the assembly target would be usefull.
– Add the instruction to execute the command with Maven:
mvn exec:java -Dexec.mainClass=”com.company.division.appName.Main”
Thank you for the great Tutorial
Petri August 27, 2012 at 9:47 am
Hi Konfusius,
Thank you for your comment. I appreciate that you took the time to write me a note about these errors and improvement ideas. I fixed the typos and I will also make changes to the article (and code) later.
Saurabh October 23, 2012 at 1:10 pm
Hi Petri,
Thanks for this very helpful tutorial. I built a sample using Spring-Hadoop based on the steps you suggested. For running the job, I am using the binary distribution mechanism and running the job using the startup script. I thing that I noted was that when running the job in this way, I don’t see the Job appearing on the JobTracker user interface. Any ideas why?
Also, the logs show, “Unable to load native-hadoop library for your platform… using builtin-java classes where applicable”. Will you be able to provide any elaboration on this.
Thanks again.
Petri October 23, 2012 at 1:45 pm
Hi Saurabh,
It is nice to hear that you find tutorial helpful.
I have a few questions concerning your problem with the JobTracker UI:
The log line you mentioned is written to the log when the native Hadoop libraries are not found. Check the Native Hadoop Libraries section of the Hadoop documentation for more information.
Petri February 26, 2013 at 6:52 pm
It seems that if a JobTracker is not configured, the job will not be visible in the JobTracker user interface.
Unmesh Upadhye November 3, 2012 at 11:14 am
Dear Sir
It gives me more confidence to work on hadoop pls. send me more information regarding
hadoop.
regards
Unmesh
Petri November 4, 2012 at 7:11 pm
Hi Unmesh,
It is nice to know that this blog entry was useful to you. Also, thanks for pointing out that you want to read more about Apache Hadoop.
Amar December 6, 2012 at 8:27 am
Hey Petri,
Thanks for a great article.
I have executed your project (got it from github). it executed successfully. I used gradle to build the project.
But surprised to see no output directory created. I have input directory and data in HDFS. Can you please help me out. Tried many things like changed Mapper and reducer. Hadoop parameters are correct as per my cluster.
Hoping for your quick response,
Amar
Petri December 6, 2012 at 10:52 am
Hi Amar,
I have got a few questions for you:
Unfortunately I don’t have access to a Hadoop cluster but I have got a local installation of Hadoop 1.0.3 which runs in the pseudo-distributed mode. If you could answer to my email and send me your Gradle build script, I can test it in my local environment.
Amar December 6, 2012 at 3:23 pm
Hi Petri,
Really appreciate quick response. I am using Hadoop 1.0.3 version only. Yes I am using Spring data hadoop 1.0.0. Also I did try it using maven but no luck :(. Surprising thing is it get executed successfully without any error. If I remove hadoop bean and a driver class then it works properly. Am I missing any configuration stuff. My application context is:
fs.default.name=hdfs://Ubuntu05:54310
simple job runner
Configures the reference to the actual Hadoop job.
I have hard coded some properties. Also tried with a property file. But no luck at all.
Thanks,
Amar
Petri December 6, 2012 at 3:45 pm
Hi Amar,
It seems that the configuration of the job runner has changed between 1.0.0.M2 (The Spring Data Apache Hadoop version used in this blog entry) and 1.0.0.RC1. Have you checked out the instructions provided in the Running a Hadoop Job section of the Spring Data Apache Hadoop reference documentation?
Amar December 10, 2012 at 9:09 am
Hi Petri,
Changing version to M2 worked. I hearty appreciate your help.
Thanks,
Amar
Petri December 10, 2012 at 9:58 am
Amar,
It is good to hear that you were able to solve your problem. I will update this blog entry and the example application when I have got time to do it.
ArunDhaJ December 11, 2012 at 7:15 am
Hi,
I’m trying out spring-data-hadoop and running it as a webapplication on tomcat. I configured everything and when running the Job from my servlet I get the following exception
SEVERE: PriviledgedActionException as:tomcat7 cause:java.io.IOException: Failed to set permissions of path: /hadoop_ws/mapred/staging/tomcat71391258236/.staging to 0700
here hadoop.tmp.dir=/hadoop_ws
I know tomcat7 user doesn’t have access to the above directory, but I’m not sure how to pass this exception.
I tried following, but now luck:
1. started tomcat as the user that have permission to /hadoop_ws directory. I changed TOMCAT7_USER and TOMCAT7_GROUP
2. hadoop dfs -chmod -R /hadoop_ws
3. changed mapreduce.jobtracker.staging.root.dir to different folder and set 777 permission.
none of the above approach worked. All the examples I find in internet is either configuring the mapreduce jobs in xml as in this post, or the application is a standalone application which runs as logged-in user.
Any help highly appreciated.
Thanks.
juan pablo December 13, 2012 at 11:04 pm
run: hadoop fs -chmod -R 777 /
Regards,
JP
Rafa February 3, 2013 at 8:21 pm
Hello Petri,
hello
I followed your nice tutorial but with a change.
I used spring hadoop 1.0.0.RC2.
I have needed a change in the JobRuner definition in the spring config file. I added
otherwise no output directory was created.
I’ll go to your next blog entry.
Rafa
I
Petri February 3, 2013 at 8:34 pm
Hi Rafa,
Thank you for your comment.
As you found out, the configuration of the job runner bean has changed between the 1.0.0.M2 and 1.0.0.RC2 versions of Spring Data Hadoop.
I have been supposed to update these blog entries but I have been busy with other activities. Thank you for pointing out that I should get it done as soon as possible.
Petri February 9, 2013 at 2:49 pm
I updated the Spring Data Apache Hadoop version to 1.0.0.RC2.
Amar February 14, 2013 at 11:54 am
Hi Petri,
I am doing loading data in bulk to hbase with Hbase MapReduce. Here I can configure HFileOuputFormat. Is there any way to configure same with spring application context?
Hoping for your quick response,
Amar
Petri February 16, 2013 at 11:19 pm
I have not personally used HBase and that is why I cannot give a definitive answer to your question. However, the following resources might be useful to you:
Bill February 26, 2013 at 1:45 am
Hi Petri,
Great tutorial!!!
I followed the steps and works locally perfectly with startup.sh.
When deploy in a master-slave cluster and run $>hadoop jar mapreduce.jar
the job start, the tasks start in both nodes but in map fase I got:
org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201302251437_0004_m_000001_0: java.lang.RuntimeException: java.lang.ClassNotFoundException: net.petrikainulainen.spring.data.apachehadoop.WordMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Any idea?
Thanks
Petri February 26, 2013 at 9:57 am
Hi Bill,
thank you for your comment.
The root cause of your problem is that the mapper class is not found from the classpath. Did you use my example application or did you create your own application?
I would start by checking out that the configuration of your job is correct (check that the mapper and reducer classes are found). Also, if you created your own application, you could try to test my example application and see if it is working. I have tested it by running Hadoop in a pseudo distributed mode and it would be interesting to hear if it works correctly in your environment.
Bill February 26, 2013 at 5:28 pm
Hi again Petri,
Thanks for your reply.
I am using your example. I have a linux host with two virtual machines(Vmware)
one as the master node and once as a slave node, cluster tested successful.
My steps below:
#HOST
1) Configure application.properties
fs.default.name=hdfs://master:54310
input.path=/user/billbravo/gutenberg
output.path=/user/billbravo/gutenberg/output
2) Make assembly(Yours example of course)
$host>mvn assembly:assembly
3) Run example locally(Successful)
$host>unzip target/mapreduce-bin.zip
$host>cd mapreduce
$host>sh startup.sh
4) Copy to master node
$host>scp mapreduce-bin.zip billbravo@master:
4) Run in the cluster
$host>ssh billbravo@master
#MASTER NODE
$master>unzip mapreduce-bin.zip
$master>cd mapreduce
$master>hadoop jar mapreduce.jar
In this step occurs the previously comment error:
org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201302251437_0004_m_000001_0: java.lang.RuntimeException: java.lang.ClassNotFoundException: net.petrikainulainen.spring.data.apachehadoop.WordMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
My goal is understand how to make a hadoop aplication(with spring of corse), run local for debug and then deploy in a remote cluster like Amazon EMR.
Thanks again for your attention
:)
2) Copy assembly to master node(The cluster has been tested previously successful)
$scp target/mapreduce-bin.zip billbravo@master:
Petri February 26, 2013 at 7:25 pm
Hi Bill,
you are welcome (about my attention). These kind of puzzles are always nice because usually I learn a lot of thing things by solving them. :)
I found the problem. I updated this blog entry and made two changes to my example:
I also updated Spring Data Apache Hadoop to version 1.0.0.RELEASE.
This should solve your problem.
Bill February 27, 2013 at 12:59 am
Hi Petri,
It works perfectly!
I found a work around copying mapreduce.jar to the Distributed Cache
and using the option hdp:cache in applicationContex.xml:
But your solution is cleaner.
Greetings
Petri February 27, 2013 at 9:33 am
Hi Bill,
It is good to hear that this solved your problem. :)
theCaffebig March 22, 2013 at 2:18 pm
great tutorial. Many thanks.
Could you please provide some information around configuring org.apache.avro.mapred.AvroJob.
My mapred job uses AvroInputformat.
Any information aorund configuring AvroJob using Spring Data would be very helpful.
Thanks
Petri March 23, 2013 at 1:40 pm
Thank you for your comment. It is nice to hear that this blog post was useful to you.
Also, thank you for providing me an idea for a new blog post. I have added your idea to my to-do list and I will probably write something about it after I have finished my Spring Data Solr tutorial (four blog posts left).
Subha April 4, 2013 at 11:17 am
Hi Petri,
I was following your tutorial step by step.i have hadoop version 0.20.0 installed.but running the program it is throwing class not found exception for the mapper class.
can you please help me.i am new to this technology.
Petri April 6, 2013 at 11:13 am
The requirements of Spring Data Apache Hadoop are described in its reference manual. It contains the following paragraph:
Spring for Apache Hadoop requires JDK level 6.0 (just like Hadoop) and above, Spring Framework 3.0 (3.2 recommended) and above and Apache Hadoop 0.20.2 (1.0.4 recommended) and above. SHDP supports and is tested daily against various Hadoop distributions, such as Cloudera CDH3 (CD3u5) and CDH4 (CDH4.1u3 MRv1) distributions and Greenplum HD (1.2). Any distro compatible with Apache Hadoop 1.0.x should be supported.
I am not sure what is the difference between 0.20.2 (minimum supported version) and 0.20.0 (your version) but it might not be possible to use Spring Data Apache Hadoop with Hadoop 0.20.0. Did you try to run the example application of this blog post or did you create your own application?
subha April 8, 2013 at 4:12 pm
Hi Petri,
i have hadoop 0.20.2(sorry for wrong information).i am running this example in eclipse ide.when i am running this for local files on my disk it is working fine.running same map reduce for hdfs it is throwing mapper class not found.
check bellow link for log.
http://stackoverflow.com/questions/15786155/sppring-data-hadoop-map-reduce-job-thrwoing-no-class-found-exception?noredirect=1#comment22445005_15786155
Petri April 8, 2013 at 8:29 pm
This happened to me as well and I had to make some changes to my example application. These changes are described in this comment. Compare these files with the configuration files of your project:
I hope that this solves your problem.
subha April 9, 2013 at 9:20 am
i have the same files in my project and using spring-data-hadoop 1.0.0 release version.still facing the problem :( .Is there any other files i need to change..??
Petri April 9, 2013 at 10:04 am
This commit fixed the problem in my example application. As you can see, I made changes only to the files mentioned in my previous comment (and updated the version number of Spring Data Apache Hadoop to the pom.xml).
Is there any chance that I can see your code? I cannot figure out what is wrong and seeing the code would definitely help.
subha April 9, 2013 at 11:39 am
Thanks for your inputs.
i have commented mapred.job.tracker=${mapred.job.tracker} in applicationContext.xml and now its working fine.but i dont know y this is happening.do u know the reason.
Petri April 9, 2013 at 12:30 pm
The reason for this might be that when the job tracker is not set, Hadoop will use the default job tracker which could be the local one. In other words, your map reduce job might not be executed in the Hadoop cluster. Check out this forum thread for more details about this.
Javier April 14, 2013 at 11:08 pm
Hello, Petri.
My current configuration includes one linux machine with the NameNode, TaskTracker and JobTracker daemons, another linux with a DataNode and a third Windows 7 machine with Eclipse/Netbeans for development.
At least with this configuration if you want to run the mapreduce job from the development machine it is mandatory to include the jar attribute in the applicationContext.xml file:
Other way the classes will not be available in the execution cluster. Hope it helps.
jv
Petri April 14, 2013 at 11:39 pm
Hi Javier,
Thank you for you comment.
Unfortunately Wordpress decided to remove the XML which you added to your comment (I should probably figure out if I can disable that feature since it is quite annoying).
Anyway, the configuration of my example application uses the jar-by-class attribute which did the trick when I last run into this problem.
However, your comment made me check if you can explicitly specify the name of the jar file. I erad the schema of Spring for Apache Hadoop configuration and find out that you can do this by using the jar attribute.
This information is indeed useful because if the jar file of the map reduce job cannot be resolved when the jar-by-class attribute is used, you can always use the jar attribute for this purpose.
Again, thanks for pointing this out.