Do want to get a better understanding of Spring web application architecture? If so, get started right now!

Creating Hadoop MapReduce Job with Spring Data Apache Hadoop

Searching Data

This tutorial describes how we can create a Hadoop MapReduce Job with Spring Data Apache Hadoop. As an example we will analyze the data of a novel called The Adventures of Sherlock Holmes and find out how many times the last name of Sherlock’s loyal sidekick Dr. Watson is mentioned in the novel.

Note: This blog entry assumes that we have already installed and configured the used Apache Hadoop instance.

We can create a Hadoop MapReduce Job with Spring Data Apache Hadoop by following these steps:

  1. Get the required dependencies by using Maven.
  2. Create the mapper component.
  3. Create the reducer component.
  4. Configure the application context.
  5. Load the application context when the application starts.

These steps are explained with more details in the following Sections. We will also learn how we can run the created Hadoop job.

Getting the Required Dependencies with Maven

We can download the required dependencies with Maven by adding the dependency declations of Spring Data Apache Hadoop and Apache Hadoop Core to our POM file. We can declare these dependencies by adding the following lines to our pom.xml file:

<!-- Spring Data Apache Hadoop -->
<dependency>
	<groupId>org.springframework.data</groupId>
	<artifactId>spring-data-hadoop</artifactId>
	<version>1.0.0.RELEASE</version>
</dependency>
<!-- Apache Hadoop Core -->
<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-core</artifactId>
	<version>1.0.3</version>
</dependency>

Creating the Mapper Component

A mapper is a component that divides the original problem into smaller problems that are easier to solve. We can create a custom mapper component by extending the Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> class and overriding its map() method. The type parameters of the Mapper class are described in following:

  • KEYIN describes the type of the key that is provided as an input to the mapper component.
  • VALUEIN describes the type of the value that is provided as an input to the mapper component.
  • KEYOUT describes the type of the mapper component’s output key.
  • VALUEOUT describes the type of the mapper component’s output value.

Each type parameter must implement the Writable interface. Apache Hadoop provides several implementations to this interface. A list of existing implementations is available at the API documentation of Apache Hadoop.

Our mapper processes the contents of the input file one line at the time and produces key-value pairs where the key is a single word of the processed line and the value is always one. Our implementation of the map() method has following steps:

  1. Split the given line into words.
  2. Iterate through each word and remove all Unicode characters that are not either letters or numbers.
  3. Create an output key-value pair by calling the write() method of the Mapper.Context class and providing the required parameters.

The source code of the WordMapper class looks following:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private Text word = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer lineTokenizer = new StringTokenizer(line);
        while (lineTokenizer.hasMoreTokens()) {
            String cleaned = removeNonLettersOrNumbers(lineTokenizer.nextToken());
            word.set(cleaned);
            context.write(word, new IntWritable(1));
        }
    }

    /**
     * Replaces all Unicode characters that are not either letters or numbers with
     * an empty string.
     * @param original  The original string.
     * @return  A string that contains only letters and numbers.
     */
    private String removeNonLettersOrNumbers(String original) {
        return original.replaceAll("[^\\p{L}\\p{N}]", "");
    }
}

Creating the Reducer Component

A reducer is a component that removes the unwanted intermediate values and passes forward only the relevant key-value pairs. We can implement our reducer by extending the Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> class and overriding its reduce() method. The type parameters of the Reducer class are described in following:

  • KEYIN describes the type of the key that is provided as an input to the reducer. The value of this type parameter must match with the KEYOUT type parameter of the used mapper.
  • VALUEIN describes the type of the value that is provided as an input to the reducer component. The value of this type parameter must match with the VALUEOUT type parameter of the used mapper.
  • KEYOUT describes type of the output key of the reducer component.
  • VALUEOUT describes the type of the output key of the reducer component.

Our reducer processes each key-value pair produced by our mapper and creates a key-value pair that contains the answer of our question. We can implement the reduce() method by following these steps:

  1. Verify that the input key contains the wanted word.
  2. If the key contains the wanted word, count how many times the word was found.
  3. Create a new output key-value pair by calling the write() method of the Reducer.Context class and providing the required parameters.

The source code of the WordReducer class is given in following:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    protected static final String TARGET_WORD = "Watson";

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        if (containsTargetWord(key)) {
            int wordCount = 0;
            for (IntWritable value: values) {
                wordCount += value.get();
            }
            context.write(key, new IntWritable(wordCount));
        }
    }

    private boolean containsTargetWord(Text key) {
        return key.toString().equals(TARGET_WORD);
    }
}

Configuring the Application Context

Because Spring Data Apache Hadoop 1.0.0.M2 does not support Java configuration, we have to configure the application context of our application by using XML. We can configure the application context of our application by following these steps:

  1. Create a properties file that contains the values of configuration properties.
  2. Configure a property placeholder that fetches the values of configuration properties from the created property file.
  3. Configure Apache Hadoop.
  4. Configure the executed Hadoop job.
  5. Configure the job runner that runs the created Hadoop job.

Creating the Properties File

Our properties file contains the values of our configuration parameters. We can create this file by following these steps:

  1. Specify the value of the fs.default.name property. The value of this property must match with the configuration of our Apache Hadoop instance.
  2. Specify the value of the mapred.job.tracker property. The value of this property must match with the configuration of our Apache Hadoop instance.
  3. Specify the value of the input.path property.
  4. Add the value of the output.path property to the properties file.

The contents of the application.properties file looks following:

fs.default.name=hdfs://localhost:9000
mapred.job.tracker=localhost:9001

input.path=/input/
output.path=/output/

Configuring the Property Placeholder

We can configure the needed property placeholder by adding the following element to the applicationContext.xml file:

<context:property-placeholder location="classpath:application.properties" />

Configuring Apache Hadoop

We can use the configuration namespace element for providing configuration parameters to Apache Hadoop. In order to execute our job by using our Apache Hadoop instance, we have to configure the default file system and the JobTracker. We can configure the default file system and the JobTracker by adding the following element to the applicationContext.xml file:

<hdp:configuration>
    fs.default.name=${fs.default.name}
    mapred.job.tracker=${mapred.job.tracker}
</hdp:configuration>

Configuring the Hadoop Job

We can configure our Hadoop job by following these steps:

  1. Configure the input path that contains the input files of the job.
  2. Configure the output path of the job.
  3. Configure the name of the main class.
  4. Configure the name of the mapper class.
  5. Configure the name of the reducer class.

Note: If the configured output path exists, the execution of the Hadoop job fails. This is a safety mechanism that ensures that the results of a MapReduce job cannot be overwritten accidentally.

We have to add the following job declaration to our application context configuration file:

<hdp:job id="wordCountJob"
			input-path="${input.path}"
			output-path="${output.path}"
			jar-by-class="net.petrikainulainen.spring.data.apachehadoop.Main"
			mapper="net.petrikainulainen.spring.data.apachehadoop.WordMapper"
			reducer="net.petrikainulainen.spring.data.apachehadoop.WordReducer"/>

Configuring the Job Runner

The job runner is responsible of executing the jobs after the application context has been loaded. We can configure our job runner by following these steps:

  1. Configure the job runner.
  2. Configure the executed jobs.
  3. Configure the job runner to run the configured jobs when it is started.

The declaration of our job runner bean is given in following:

<hdp:job-runner id="wordCountJobRunner" job-ref="wordCountJob" run-at-startup="true"/>

Loading the Application Context When the Application Starts

We can execute the created Hadoop job by loading the application context when our application is started. We can do this by creating a new ClasspathXmlApplicationContext object and providing the name of our application context configuration file as a constructor parameter. The source code of our Main class is given in following:

import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;

public class Main {
    public static void main(String[] arguments) {
        ApplicationContext ctx = new ClassPathXmlApplicationContext("applicationContext.xml");
    }
}

Running the MapReduce Job

We have now learned how we can create a Hadoop MapReduce job with Spring Data Apache Hadoop. Our next step is to execute the created job. The first thing we have to do is to download The Adventures of Sherlock Holmes. We must download the plain text version of this novel manually since the website of Project Gutenberg is blocking download utilities such as wget.

After we have downloaded the input file, we are ready to run our MapReduce job. We can run the created job by starting our Apache Hadoop instance in a pseudo-distributed mode and following these steps:

  1. Upload our input file to HDFS.
  2. Run our MapReduce job.

Uploading the Input File to HDFS

Our next step is to upload our input file to HDFS. We can do this by running the following command at command prompt:

hadoop dfs -put pg1661.txt /input/pg1661.txt

We can check that everything went fine by running the following command at command prompt:

hadoop dfs -ls /input

If the file was uploaded successfully, we should see the following directory listing:

Found 1 items
-rw-r--r--   1 xxxx supergroup     594933 2012-08-05 12:07 /input/pg1661.txt

Running Our MapReduce Job

We have two alternative methods for running our MapReduce job:

  • We can execute the main() method of the Main class from our IDE.
  • We can build a binary distribution of our example project by running a command mvn assembly:assembly at command prompt. This creates a zip package to the target directory. We can run the created MapReduce job by unzipping this package and using the provided startup scripts.

Note: If you are not familiar with the Maven assembly plugin, you might want to read my blog entry that describes how you can create a runnable binary distribution with the Maven assembly plugin.

The outcome of our MapReduce job does not depend from the method that is used to run it. The outcome of our job should be written to the configured outcome directory of HDFS.

Note: If the execution of our MapReduce job fails because the output directory exists, we can delete the output directory by running the following command at command prompt:

hadoop dfs -rmr /output

We can check the output of our job by running the following command at command prompt:

hadoop dfs -ls /output

This command lists the files found from the /output directory of HDFS. If everything went fine, we should see a similar directory listing:

Found 2 items
-rw-r--r--   3 xxxx supergroup          0 2012-08-05 12:31 /output/_SUCCESS
-rw-r--r--   3 xxxx supergroup         10 2012-08-05 12:31 /output/part-r-00000

Now we will finally find out the answer to our question. We can get the answer by running the following command at command prompt:

hadoop dfs -cat /output/part-r-00000

If everything went fine, we should see following output:

Watson	81

We now know that the last name of doctor Watson was mentioned 81 times in the novel The Adventures of Sherlock Holmes.

What is Next?

My next blog entry about Apache Hadoop describes how we can create a streaming MapReduce job by using Hadoop Streaming and Spring Data Apache Hadoop.

PS. A fully functional example application that was described in this blog entry is available at Github.

If you enjoyed reading this blog post, you should follow me on Twitter:

About the Author

Petri Kainulainen is passionate about software development and continuous improvement. He is specialized in software development with the Spring Framework and is the author of Spring Data book.

About Petri Kainulainen →

74 comments… add one

  • My background: I know Maven and did some Projects at the university, but I’m new to Spring and Hadoop, so I needed to take a closer look at the spring configuration part.

    Here are a few questions and hints which might improve your great article:
    – “…Holmes and find out how many the last name of Sherlock’s…” -> “…Holmes and find out how many ”’times”’ the last name of Sherlock’s…”

    – maybe create the maven project with the artifact:generate command, and add the dependencies afterwards
    mvn archetype:generate \
    -DarchetypeArtifactId=maven-archetype-quickstart \
    -DgroupId=com.company.division \
    -DartifactId=appName \
    -Dversion=1.0-SNAPSHOT \
    -Dpackage=com.company.division.appName

    – Creating the Mapper Component: You could add a note that we will configure the location of the classes in the application.xml afterwards + a proposal to create a package for those classes

    – Path for the files applicationContext.xml and application.properties: You could add a note that the path to the applicationContext.xml will be configured in the Main class afterwards and tat the convention is that the file goes into /src/main/ressources/META-INF/applicationContext.xml
    Links to basic explanations for the bare minimum of a applicationContext skeleton might be usefull, including the needed xmlns definitions.

    – funny typo: the “napper class”

    – for the “hadoop dfs –ls /input” command you have used a “–” instead of “-” (I don’t know the english terms) which will give copypasting readers some headache. (same for “hadoop dfs –cat”)

    – A Link to Instructions for the assembly target would be usefull.

    – Add the instruction to execute the command with Maven:
    mvn exec:java -Dexec.mainClass=”com.company.division.appName.Main”

    Thank you for the great Tutorial

    Reply
    • Hi Konfusius,

      Thank you for your comment. I appreciate that you took the time to write me a note about these errors and improvement ideas. I fixed the typos and I will also make changes to the article (and code) later.

      Reply
  • Hi Petri,

    Thanks for this very helpful tutorial. I built a sample using Spring-Hadoop based on the steps you suggested. For running the job, I am using the binary distribution mechanism and running the job using the startup script. I thing that I noted was that when running the job in this way, I don’t see the Job appearing on the JobTracker user interface. Any ideas why?

    Also, the logs show, “Unable to load native-hadoop library for your platform… using builtin-java classes where applicable”. Will you be able to provide any elaboration on this.

    Thanks again.

    Reply
  • Dear Sir
    It gives me more confidence to work on hadoop pls. send me more information regarding
    hadoop.
    regards
    Unmesh

    Reply
    • Hi Unmesh,

      It is nice to know that this blog entry was useful to you. Also, thanks for pointing out that you want to read more about Apache Hadoop.

      Reply
  • Hey Petri,

    Thanks for a great article.

    I have executed your project (got it from github). it executed successfully. I used gradle to build the project.

    But surprised to see no output directory created. I have input directory and data in HDFS. Can you please help me out. Tried many things like changed Mapper and reducer. Hadoop parameters are correct as per my cluster.

    Hoping for your quick response,
    Amar

    Reply
    • Hi Amar,

      I have got a few questions for you:

      • You mentioned that you are using a Hadoop cluster to run the example project. Which Hadoop version you are using?
      • Have you tried updating the Spring Data Apache Hadoop version to 1.0.0.RC1 which is the newest available version?
      • Have you been trying to build the project by using Maven? If you have, did it behave in the same way?

      Unfortunately I don’t have access to a Hadoop cluster but I have got a local installation of Hadoop 1.0.3 which runs in the pseudo-distributed mode. If you could answer to my email and send me your Gradle build script, I can test it in my local environment.

      Reply
  • Hi Petri,

    Really appreciate quick response. I am using Hadoop 1.0.3 version only. Yes I am using Spring data hadoop 1.0.0. Also I did try it using maven but no luck :(. Surprising thing is it get executed successfully without any error. If I remove hadoop bean and a driver class then it works properly. Am I missing any configuration stuff. My application context is:

    fs.default.name=hdfs://Ubuntu05:54310

    simple job runner

    Configures the reference to the actual Hadoop job.

    I have hard coded some properties. Also tried with a property file. But no luck at all.

    Thanks,
    Amar

    Reply
    • Hi Amar,

      It seems that the configuration of the job runner has changed between 1.0.0.M2 (The Spring Data Apache Hadoop version used in this blog entry) and 1.0.0.RC1. Have you checked out the instructions provided in the Running a Hadoop Job section of the Spring Data Apache Hadoop reference documentation?

      Reply
  • Hi Petri,

    Changing version to M2 worked. I hearty appreciate your help.

    Thanks,
    Amar

    Reply
    • Amar,

      It is good to hear that you were able to solve your problem. I will update this blog entry and the example application when I have got time to do it.

      Reply
  • Hi,
    I’m trying out spring-data-hadoop and running it as a webapplication on tomcat. I configured everything and when running the Job from my servlet I get the following exception

    SEVERE: PriviledgedActionException as:tomcat7 cause:java.io.IOException: Failed to set permissions of path: /hadoop_ws/mapred/staging/tomcat71391258236/.staging to 0700

    here hadoop.tmp.dir=/hadoop_ws

    I know tomcat7 user doesn’t have access to the above directory, but I’m not sure how to pass this exception.

    I tried following, but now luck:
    1. started tomcat as the user that have permission to /hadoop_ws directory. I changed TOMCAT7_USER and TOMCAT7_GROUP
    2. hadoop dfs -chmod -R /hadoop_ws
    3. changed mapreduce.jobtracker.staging.root.dir to different folder and set 777 permission.

    none of the above approach worked. All the examples I find in internet is either configuring the mapreduce jobs in xml as in this post, or the application is a standalone application which runs as logged-in user.

    Any help highly appreciated.

    Thanks.

    Reply
    • run: hadoop fs -chmod -R 777 /

      Regards,

      JP

      Reply
  • Hello Petri,

    hello

    I followed your nice tutorial but with a change.

    I used spring hadoop 1.0.0.RC2.

    I have needed a change in the JobRuner definition in the spring config file. I added

    otherwise no output directory was created.

    I’ll go to your next blog entry.

    Rafa

    I

    Reply
    • Hi Rafa,

      Thank you for your comment.

      As you found out, the configuration of the job runner bean has changed between the 1.0.0.M2 and 1.0.0.RC2 versions of Spring Data Hadoop.

      I have been supposed to update these blog entries but I have been busy with other activities. Thank you for pointing out that I should get it done as soon as possible.

      Reply
  • I updated the Spring Data Apache Hadoop version to 1.0.0.RC2.

    Reply
  • Hi Petri,

    I am doing loading data in bulk to hbase with Hbase MapReduce. Here I can configure HFileOuputFormat. Is there any way to configure same with spring application context?

    Hoping for your quick response,
    Amar

    Reply
  • Hi Petri,

    Great tutorial!!!
    I followed the steps and works locally perfectly with startup.sh.
    When deploy in a master-slave cluster and run $>hadoop jar mapreduce.jar
    the job start, the tasks start in both nodes but in map fase I got:

    org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201302251437_0004_m_000001_0: java.lang.RuntimeException: java.lang.ClassNotFoundException: net.petrikainulainen.spring.data.apachehadoop.WordMapper
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867)
    at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

    Any idea?

    Thanks

    Reply
    • Hi Bill,

      thank you for your comment.

      The root cause of your problem is that the mapper class is not found from the classpath. Did you use my example application or did you create your own application?

      I would start by checking out that the configuration of your job is correct (check that the mapper and reducer classes are found). Also, if you created your own application, you could try to test my example application and see if it is working. I have tested it by running Hadoop in a pseudo distributed mode and it would be interesting to hear if it works correctly in your environment.

      Reply
      • Hi again Petri,

        Thanks for your reply.

        I am using your example. I have a linux host with two virtual machines(Vmware)
        one as the master node and once as a slave node, cluster tested successful.

        My steps below:

        #HOST
        1) Configure application.properties

        fs.default.name=hdfs://master:54310
        input.path=/user/billbravo/gutenberg
        output.path=/user/billbravo/gutenberg/output

        2) Make assembly(Yours example of course)

        $host>mvn assembly:assembly

        3) Run example locally(Successful)

        $host>unzip target/mapreduce-bin.zip
        $host>cd mapreduce
        $host>sh startup.sh

        4) Copy to master node

        $host>scp mapreduce-bin.zip billbravo@master:

        4) Run in the cluster

        $host>ssh billbravo@master
        #MASTER NODE
        $master>unzip mapreduce-bin.zip
        $master>cd mapreduce
        $master>hadoop jar mapreduce.jar

        In this step occurs the previously comment error:

        org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201302251437_0004_m_000001_0: java.lang.RuntimeException: java.lang.ClassNotFoundException: net.petrikainulainen.spring.data.apachehadoop.WordMapper
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867)
        at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

        My goal is understand how to make a hadoop aplication(with spring of corse), run local for debug and then deploy in a remote cluster like Amazon EMR.

        Thanks again for your attention

        :)

        2) Copy assembly to master node(The cluster has been tested previously successful)

        $scp target/mapreduce-bin.zip billbravo@master:

        Reply
        • Hi Bill,

          you are welcome (about my attention). These kind of puzzles are always nice because usually I learn a lot of thing things by solving them. :)

          I found the problem. I updated this blog entry and made two changes to my example:

          1. Added JobTracker configuration to both application.properties and applicationContext.xml files.
          2. Added main class configuration to applicationContext.xml file.

          I also updated Spring Data Apache Hadoop to version 1.0.0.RELEASE.

          This should solve your problem.

          Reply
          • Hi Petri,

            It works perfectly!

            I found a work around copying mapreduce.jar to the Distributed Cache
            and using the option hdp:cache in applicationContex.xml:

            But your solution is cleaner.

            Greetings

          • Hi Bill,

            It is good to hear that this solved your problem. :)

          • I have the same problems. but I don not know how to fix it . Could you get me more code to descript it. Thank you.

          • Hi,

            The solution to his was problem is described in this comment. You can find the relevant files from Github:

          • Hi again Petri,

            Thanks for your reply.

            I saw the comment. and added JobTracker configuration to both application.properties and applicationContext.xml

            application.properties:
            fs.default.name=hdfs://localhost:9100
            mapred.job.tracker=localhost:9101

            applicationContext.xml:

            fs.default.name=${fs.default.name}
            mapred.job.tracker=${mapred.job.tracker}

            In addition, I updated Spring Data Apache Hadoop to version 1.0.1.RELEASE

            # hadoop dfs -mkdir /input
            # hadoop dfs -ls /
            Found 3 items
            drwxr-xr-x – root supergroup 0 2013-10-12 11:40 /hbase
            drwxr-xr-x – root supergroup 0 2013-10-12 12:05 /input
            drwxr-xr-x – root supergroup 0 2013-10-12 11:40 /tmp
            # hadoop dfs -put sample.txt /input/sample.txt
            # hadoop dfs -ls /input
            Found 1 items
            -rw-r–r– 3 root supergroup 51384 2013-10-12 12:09 /input/sample.txt

            Unfortunately, when I run the progrem

            java.lang.RuntimeException: java.lang.ClassNotFoundException: net.petrikainulainen.spring.data.apachehadoop.WordMapper
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
            at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
            at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
            at org.apache.hadoop.mapred.Child.main(Child.java:249)
            Caused by: java.lang.ClassNotFoundException: net.petrikainulainen.spring.data.apachehadoop.WordMapper
            at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
            at java.security.AccessController.doPrivileged(Native Method)
            at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
            at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
            at java.lang.Class.forName0(Native Method)
            at java.lang.Class.forName(Class.java:249)
            at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
            … 8 more

          • OMG!

            I am sorry. I am running this example in eclipse ide.when I am running this for local files on my disk it is OK.

            By the way, do you has any example about hbasetemplate?

          • Good to hear that you were able to solve your problem (eventually). Unfortunately I haven’t written anything about the hdbasetemplate. I should probably pay more attention to this stuff in the future though.

  • great tutorial. Many thanks.
    Could you please provide some information around configuring org.apache.avro.mapred.AvroJob.
    My mapred job uses AvroInputformat.
    Any information aorund configuring AvroJob using Spring Data would be very helpful.

    Thanks

    Reply
    • Thank you for your comment. It is nice to hear that this blog post was useful to you.

      Also, thank you for providing me an idea for a new blog post. I have added your idea to my to-do list and I will probably write something about it after I have finished my Spring Data Solr tutorial (four blog posts left).

      Reply
  • Hi Petri,
    I was following your tutorial step by step.i have hadoop version 0.20.0 installed.but running the program it is throwing class not found exception for the mapper class.
    can you please help me.i am new to this technology.

    Reply
    • The requirements of Spring Data Apache Hadoop are described in its reference manual. It contains the following paragraph:

      Spring for Apache Hadoop requires JDK level 6.0 (just like Hadoop) and above, Spring Framework 3.0 (3.2 recommended) and above and Apache Hadoop 0.20.2 (1.0.4 recommended) and above. SHDP supports and is tested daily against various Hadoop distributions, such as Cloudera CDH3 (CD3u5) and CDH4 (CDH4.1u3 MRv1) distributions and Greenplum HD (1.2). Any distro compatible with Apache Hadoop 1.0.x should be supported.

      I am not sure what is the difference between 0.20.2 (minimum supported version) and 0.20.0 (your version) but it might not be possible to use Spring Data Apache Hadoop with Hadoop 0.20.0. Did you try to run the example application of this blog post or did you create your own application?

      Reply
  • Hi Petri,
    i have hadoop 0.20.2(sorry for wrong information).i am running this example in eclipse ide.when i am running this for local files on my disk it is working fine.running same map reduce for hdfs it is throwing mapper class not found.
    check bellow link for log.

    http://stackoverflow.com/questions/15786155/sppring-data-hadoop-map-reduce-job-thrwoing-no-class-found-exception?noredirect=1#comment22445005_15786155

    Reply
    • This happened to me as well and I had to make some changes to my example application. These changes are described in this comment. Compare these files with the configuration files of your project:

      I hope that this solves your problem.

      Reply
      • i have the same files in my project and using spring-data-hadoop 1.0.0 release version.still facing the problem :( .Is there any other files i need to change..??

        Reply
        • This commit fixed the problem in my example application. As you can see, I made changes only to the files mentioned in my previous comment (and updated the version number of Spring Data Apache Hadoop to the pom.xml).

          Is there any chance that I can see your code? I cannot figure out what is wrong and seeing the code would definitely help.

          Reply
          • Thanks for your inputs.
            i have commented mapred.job.tracker=${mapred.job.tracker} in applicationContext.xml and now its working fine.but i dont know y this is happening.do u know the reason.

          • The reason for this might be that when the job tracker is not set, Hadoop will use the default job tracker which could be the local one. In other words, your map reduce job might not be executed in the Hadoop cluster. Check out this forum thread for more details about this.

  • Hello, Petri.

    My current configuration includes one linux machine with the NameNode, TaskTracker and JobTracker daemons, another linux with a DataNode and a third Windows 7 machine with Eclipse/Netbeans for development.

    At least with this configuration if you want to run the mapreduce job from the development machine it is mandatory to include the jar attribute in the applicationContext.xml file:

    Other way the classes will not be available in the execution cluster. Hope it helps.

    jv

    Reply
    • Hi Javier,

      Thank you for you comment.

      Unfortunately Wordpress decided to remove the XML which you added to your comment (I should probably figure out if I can disable that feature since it is quite annoying).

      Anyway, the configuration of my example application uses the jar-by-class attribute which did the trick when I last run into this problem.

      However, your comment made me check if you can explicitly specify the name of the jar file. I erad the schema of Spring for Apache Hadoop configuration and find out that you can do this by using the jar attribute.

      This information is indeed useful because if the jar file of the map reduce job cannot be resolved when the jar-by-class attribute is used, you can always use the jar attribute for this purpose.

      Again, thanks for pointing this out.

      Reply
  • Hello Interesting read, am a undergraduate student and wanting to leverage these techniques for an application as final year project Bsc IT. I am expecting a lot of data etc so I opted for mongodb. I’ve been using spring-data-mongo and been wondering if one could use spring-data-mongo and spring-data-hadoop together. mongodb has a way to integrate with hadoop. so how does everything play nice together? thank you

    Reply
    • It is possible to use multiple Spring Data libraries in the same project. I have not personally used spring-data-hadoop and spring-data-mongo in the same project but it should be doable.

      On the other hand, it is kind of hard to say if this makes any sense because I have no idea what kind of an application you are going to create. Could you shed some light on this?

      If you can describe the use cases where you would like to use spring-data-hadoop and spring-data-mongo, I can probably give you a better answer.

      Reply
  • Hi Petri,
    First of all thank you for the tutorial. I downloaded the example maven project from github. I am able to run the application. But, It gives ClassNotFoundException for WordMapper class. I am using apache hadoop 1.2.1 version. Spring 3.1.0.
    I am connecting to hdfs from remote vm. can you give me any suggesstion to solve this problem.

    Reply
  • Hi sir, i am trying to run your example above mentioned . i am getting the below error ,unable to solve this .. plz help me. i tried in some other forums and blogs unable to solve this….

    INFO – sPathXmlApplicationContext – Refreshing org.springframework.context.support.ClassPathXmlApplicationContext@46ae506e: startup date [Tue Oct 29 11:02:44 IST 2013]; root of context hierarchy
    INFO – XmlBeanDefinitionReader – Loading XML bean definitions from class path resource [applicationContext.xml]
    INFO – urcesPlaceholderConfigurer – Loading properties file from class path resource [application.properties]
    INFO – DefaultListableBeanFactory – Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@69267649: defining beans [org.springframework.context.support.PropertySourcesPlaceholderConfigurer#0,hadoopConfiguration,
    wordCountJob,wordCountJobRunner]; root of factory hierarchy
    INFO – JobRunner – Starting job [wordCountJob]
    WARN – JobClient – No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    INFO – FileInputFormat – Total input paths to process : 1
    WARN – NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    WARN – LoadSnappy – Snappy native library not loaded
    INFO – JobClient – Running job: job_201310291053_0002
    INFO – JobClient – map 0% reduce 0%
    INFO – JobClient – Task Id : attempt_201310291053_0002_m_000000_0, Status : FAILED
    java.lang.RuntimeException: java.lang.ClassNotFoundException: test.WordMapper
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)

    Update: I removed the unnecessary lines so that the comment is a bit cleaner. – Petri

    Reply
    • Hi,

      Did you already try the advice given in this comment?

      Reply
      • hi petri .

        the same code which i tried above working fine today. but the output of my job was nothing . there are some word matchings in the input file ..

        a)input file content:

        hadoop shiva rama hql
        java hadoop

        b) my properties file
        fs.default.name=hdfs://localhost:54310
        mapred.job.tracker=localhost:54311

        input.path=/input3/
        output.path=/output7/

        c) application context was same as yours..
        d) i am using apache hadoop 1.2.1 in pseudo distributed in my system. and trying ur example.. using STS(spring tools suite ) , by running Main.java class a java application im getting below output in console

        INFO – sPathXmlApplicationContext – Refreshing org.springframework.context.support.ClassPathXmlApplicationContext@46ae506e: startup date [Wed Oct 30 12:53:21 IST 2013]; root of context hierarchy
        INFO – XmlBeanDefinitionReader – Loading XML bean definitions from class path resource [applicationContext.xml]
        INFO – urcesPlaceholderConfigurer – Loading properties file from class path resource [application.properties]
        INFO – DefaultListableBeanFactory – Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@69267649: defining beans [org.springframework.context.support.PropertySourcesPlaceholderConfigurer#0,
        hadoopConfiguration,wordCountJob,wordCountJobRunner]; root of factory hierarchy
        INFO – JobRunner – Starting job [wordCountJob]
        INFO – FileInputFormat – Total input paths to process : 1
        WARN – NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
        WARN – LoadSnappy – Snappy native library not loaded
        INFO – JobClient – Running job: job_201310301123_0008
        INFO – JobClient – map 0% reduce 0%
        INFO – JobClient – map 100% reduce 0%
        INFO – JobClient – map 100% reduce 33%
        INFO – JobClient – map 100% reduce 100%
        INFO – JobClient – Job complete: job_201310301123_0008
        INFO – JobClient – Counters: 28
        INFO – JobClient – Job Counters
        INFO – JobClient – Launched reduce tasks=1
        INFO – JobClient – SLOTS_MILLIS_MAPS=6477
        INFO – JobClient – Total time spent by all reduces waiting after reserving slots (ms)=0
        INFO – JobClient – Total time spent by all maps waiting after reserving slots (ms)=0
        INFO – JobClient – Launched map tasks=1
        INFO – JobClient – Data-local map tasks=1
        INFO – JobClient – SLOTS_MILLIS_REDUCES=9332
        INFO – JobClient – File Output Format Counters
        INFO – JobClient – Bytes Written=0
        INFO – JobClient – FileSystemCounters
        INFO – JobClient – FILE_BYTES_READ=76
        INFO – JobClient – HDFS_BYTES_READ=134
        INFO – JobClient – FILE_BYTES_WRITTEN=109536
        INFO – JobClient – File Input Format Counters
        INFO – JobClient – Bytes Read=34
        INFO – JobClient – Map-Reduce Framework
        INFO – JobClient – Map output materialized bytes=76
        INFO – JobClient – Map input records=2
        INFO – JobClient – Reduce shuffle bytes=76
        INFO – JobClient – Spilled Records=12
        INFO – JobClient – Map output bytes=58
        INFO – JobClient – CPU time spent (ms)=1010
        INFO – JobClient – Total committed heap usage (bytes)=163581952
        INFO – JobClient – Combine input records=0
        INFO – JobClient – SPLIT_RAW_BYTES=100
        INFO – JobClient – Reduce input records=6
        INFO – JobClient – Reduce input groups=5
        INFO – JobClient – Combine output records=0
        INFO – JobClient – Physical memory (bytes) snapshot=237084672
        INFO – JobClient – Reduce output records=0
        INFO – JobClient – Virtual memory (bytes) snapshot=1072250880
        INFO – JobClient – Map output records=6
        INFO – JobRunner – Completed job [wordCountJob]

        In my hdfs ,im getting the output7 directory with _success,logs and part-00000 files , the part-00000 is empty file ,it should display hadoop 2 as per my input file .

        plz help me out .,
        2)plz suggest me me how to use that “jar attribute” in my applicationcontext.xml .

        Reply
        • hi petri .. got the answer for my problem . thanks …
          i am making a jar of my project in the local drive, giving that path to classpath of my projects run configuration… though im not using any “jar ” attribute in the applicationcontext.xml file.

          Reply
          • Hi,

            It is good to hear that you could solve your problem!

  • Hi Petrik,

    I am facing the similar issue class not found when running from eclipse and submitting to jobs to pseudo mode hadoop machine.

    Tried the options below you have provided not works for the sample project you have prvided.
    Add jar-by-class attribute to the job configuration
    Use the jar attribute if jar-by-class does not work
    But adding attribute libs and providing the path of jar but that approach is hard of development to bind libs attribute in every job bean.

    Why jar-by-class attribute is not working?
    How the big workflows are developed using Spring Hadoop?

    Kindly share your thoughts!

    Reply
    • I have to admit that I haven’t been using Spring Data for Apache Hadoop after I wrote this tutorial but I remember that the version of Spring Data for Apache Hadoop which I used in this tutorial didn’t support all Hadoop versions. Which Hadoop version are you using?

      Reply
  • Hi,

    I use spring-data-hadoop version 1.0.2.RELEASE
    and use hadoop-core 1.2.1 …
    The problem is empty output file named part-r-00000.
    The program finished successfully but when I use command:
    hadoop dfs -cat /output/part-r-00000
    nothing to list…

    Please help

    Reply
    • I have to confess that I have no idea what your problem is. I will update the example application of this blog post to use the latest stable version of Spring for Apache Hadoop and see if I run into this issue as well.

      Reply
  • Hi Petri,
    i found one easiest place of you to integrate spring and hadoop. I followed your instructions step by step and every thing went without errors.
    But, i got INFO: Starting job [wordCountJob] and not ended which means execution was not completed. Why this is so, could you help me?

    Reply
    • Can you see the job in the job tracker user interface?

      Reply
  • Hi,

    I have problem about executing project within eclipse IDE. I got noClassFoundError…
    Do I need to add eclipse hadoop plugin or anything else?
    Do I need any extra configuration about eclipse?
    Note: I can execute project after making jar and execute with “hadoop jar …jar”

    Reply
    • Hi,

      I have solved the problem by adding below code in hdp:job tag:

      libs=”${LIB_DIR}/mapreduce.jar”

      But still I don’t know why couldnt execute without above code.
      Can you help?
      Thanks…

      Reply
      • Did you import the project into Eclipse by using m2e (how to import Maven project into Eclipse)? I haven’t used Eclipse in a seven years so I am not sure how the Maven integration is working at the moment.

        However, it seems that for some reason Eclipse didn’t add the dependencies of the project into classpath. I assume that if the project is imported as a Maven project, the Maven integraation of Eclipse should take care of this.

        Reply
        • Yes. I imported project into eclipse by using m2e. But the same problem occured in IntelliJ IDE.
          You are right that libraries can not be added to classpath.

          By the way I want to debug project on eclipse. Any help?
          Thanks so much

          Reply
  • i am successfully connected multi node and also uploaded files in hdfs(it’s only master files).
    how to direct to upload the files in slave to master hdfs on ubuntu?

    Reply
  • Hi Petri,
    Iam trying to run word cout program using spring-hadoop.
    Iam using hadoop-2.0.0-cdh4.2.0 MR1 in stand alone mode.
    While executing ,its resulting in following exception.
    Server IPC version 7 cannot communicate with client version 4.

    Iam unable to connect to hdfs.Please help me out

    Reply
    • Hi,

      The example application of this blog post uses Spring for Apache Hadoop 1.0.0. This version doesn’t support Apache Hadoop 2.0.0 (See this comment for more details about this).

      You might want to update Spring for Apache Hadoop to version 1.0.2. It should support Apache Hadoop 2.0.0 (I haven’t tried it myself).

      Reply
  • Hi I followed above steps.but I got this type of error.

    WARNING: Failed to scan JAR [file:/B:/Software/STS-SpringToolSuite/springsource/vfabric-tc-server-developer-2.9.5.SR1/Spring-VM/wtpwebapps/Hadoopfitness/WEB-INF/lib/core-3.1.1.jar] from WEB-INF/lib
    java.util.zip.ZipException: invalid CEN header (bad signature)

    Reply
    • The jar file called core-3.1.1.jar is probably corrupted in some way. This StackOverflow answer provides some clues which might help you to solve this issue.

      The location of the jar file suggests that you have downloaded it manually. Is there some reason why you didn’t get it by using Maven?

      Reply
  • Hi,
    I’m struggling with getting this thing working, already put it on stackoverflow – can u check and help me figure out what’s wrong :

    http://stackoverflow.com/questions/18396151/classnotfoundexception-after-job-submission

    Reply
    • Hi,

      Have you tried following the advise given in this comment?

      Reply
      • Hi Petri,

        Yes, I have already done what has been mentioned in the comment – I think the issue is with the Mapper and Reducer classes not becoming available to the Hadoop framework but I’m not able to figure out !
        Can you have a look at the stackoverflow thread I have mentioned(even a bounty hasn’t helped me out :( )

        Reply
      • Hi Petri,

        Just to clarify – I tried adding jar-by-class=”com.hadoop.basics.Main” and creating independent mapper and reducer classes(not updated in the stackoverflow thread where I have put the code with inner classes) but still the same error.

        Regards,
        KA

        Reply
        • Did you add the job tracker configuration to your application context configuration file?

          When I was investigating the problem of the person who originally asked the same question, I noticed that adding the jar-by-class attribute didn’t solve the problem. I had to configure the job tracker as well.

          You can configure the job tracker by adding the following snippet to your application context configuration file (you can replace the property placeholders with actual values as well):

          
          <hdp:configuration>
              fs.default.name=${fs.default.name}
              mapred.job.tracker=${mapred.job.tracker}
          </hdp:configuration>
          
          
          Reply
          • Hi Petri,

            Yeah u r right about the ‘jar-by-class’ not solving the problem but I have already configured the job tracker in the applicationContext.xml as follows:

            Yet, I’m getting the ClassNotFoundException for mapper and reducer classes
            Can u spare some time and have a look at the issue detailed at :

            http://stackoverflow.com/questions/18396151/classnotfoundexception-after-job-submission/25177178?noredirect=1#comment39202613_25177178

          • It seems that there are a few differences between your code and my code:

            • Your reducer and mapper classes are inner classes of the main class (and mine are not). Have you tried moving those components away from the main class?
            • Are you trying to run your job from your IDE? If I remember correctly, I have never tried to run this job from my IDE (so I have no idea if it will work).

Leave a Comment