Book Review: Scaling Big Data with Hadoop and Solr

Disclosure: I have written a book which was published by Packt Publishing, and I received a free review copy of this book.

Scaling Big Data with Hadoop and Solr by Hrishikesh Karambelkar is Packt Publishing's latest book about Big Data.

I had high hopes on this one because its description promises that

  1. It is a step-by-step guide that helps you build high performance search engines with Apache Hadoop and Solr.
  2. You can understand the book without any prior experience from Apache Hadoop and Solr.

Let's find out if this book keeps these promises.

What Is Found Between the Covers?

The book is divided into five chapters and three appendices which are described in the following:

The first chapter describes the problems which are solved by Big Data. It gives a short introduction to Apache Hadoop and its ecosystem. It also helps you to install and configure Apache Hadoop, and has a section which talks about its administration tools.

The first chapter is solid and it gives a really good description about the Hadoop Distributed File System (HDFS). Also, the description of the map-reduce algorithm is one of the best I have ever seen.

Chapter two gives an overview of the architecture of Apache Solr, and describes how you can install and configure Apache Solr.

This chapter does a good job of explaining the different request handlers but unfortunately the description of the Solr schema is a bit vague. It feels a bit like a reference manual which might be a problem if you don't have any experience from Solr.

The third chapter describes the problems which Solr can solve on its own and identifies the benefits of distributed search. It introduces different data processing work flows, and describes the advantages and disadvantages of each work flow. This chapter ends by describing the tools which can be used to implement distributed search with Apache Solr.

The third chapter has a very good start but the end of this chapter raises more questions than it answers. To be honest, it feels a bit confusing because it doesn’t answer to the question:

How can I use these tools?

Chapter four describes how you can index data by using Big Data technologies. It starts by describing the NoSQL databases and the CAP theorem. Then it gives an introduction to the concept of distributed search. It also describes how you can integrate Hadoop, Solr, and HBase by using Lily. The chapter ends by describing how you can divide your Solr index into multiple shards by using SolrCloud and ZooKeeper.

This chapter was a good read but it has two problems:

  • The description of Lily's installation process was a bit vague. For example, I have no idea where I should copy its jar files.
  • It assumes that you don’t run into problems. I understand that it is impossible to cover all exceptional situations in a book. However, it could have provided the answers to the most common problems or at least point out resources which are useful if you run into problems.

The fifth chapter concentrates on optimising the performance of Apache Solr. It describes how you can optimize your schema, Solr index, and search runtime. Also, It provides tips for improving the performance of the Java EE container which runs your Solr instance, and introduces different ways to monitor the performance of your setup.

In my opinion, this chapter is the best chapter of the book. It provides concrete advice which you can put to use right away.

Appendix A describes two different use cases for Big Data based search function. The selected use cases are good and I think that the author argued his case very well.

Appendix B describes how you can configure your Solr instance when you are implementing one of the use cases mentioned in Appendix A. I enjoyed reading this section of the book and I learned some new tricks as well.

Appendix C describes how you can add data to Solr index by using the tools described in chapter 3. Although I was happy to finally see some code, I was surprised to see that the code samples weren’t explained properly. This makes it pretty hard to understand them if you don't have any experience from these tools.

So, What Is the Verdict?

I have mixed feelings about this book.

It is clear that the author is an expert in this field, and he explains these complex topics in an understandable way. This book gives a good overview about the subject but it concentrates primarily on theory.

Although the theory is presented in a clear way, the book offers very little advice on applying this theory to practice. This was a disappointment to me because this book is advertised as a step-by-step guide.

In other words, Scaling Big Data with Hadoop and Solr gives a good introduction to the subject but be prepared to search more information from other sources.

1 comment… add one
  • Steve Aug 16, 2016 @ 22:18

    You're right, a step-by-step guide should not be only theory.

Leave a Reply