Cloudera Search Installation Guide

Cloudera Search Installation Guide
Cloudera, Inc.
220 Portage Avenue
Palo Alto, CA 94306
info@cloudera.com
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Important Notice
2010-2013 Cloudera, Inc. All rights reserved.
Cloudera, the Cloudera logo, Cloudera Impala, Impala, and any other product or service names or
slogans contained in this document, except as otherwise disclaimed, are trademarks of Cloudera and its
suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior
written permission of Cloudera or the applicable trademark holder.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other
trademarks, registered trademarks, product names and company names or logos mentioned in this
document are the property of their respective owners. Reference to any products, services, processes or
other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute
or imply endorsement, sponsorship or recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights
under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval
system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise), or for any purpose, without the express written permission of Cloudera.
Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Cloudera, the furnishing of this document does not give you any license to these
patents, trademarks copyrights, or other intellectual property.
The information in this document is subject to change without notice. Cloudera shall not be liable for
any damages resulting from technical errors or omissions which may be present in this document, or
from use of this document.
Version: Cloudera Search Beta, version 0.9.0
Date: June 4, 2013
Contents
ABOUT THIS GUIDE ................................................................................................................................................ 1
GUIDELINES FOR DEPLOYING CLOUDERA SEARCH ................................................................................................. 1
THE IMPORTANCE OF USE CASE DEFINITION .................................................................................................................... 1
CLOUDERA SEARCH REQUIREMENTS ..................................................................................................................... 3
CDH REQUIREMENT ................................................................................................................................................ 3
OPERATING SYSTEMS ............................................................................................................................................... 4
JDK..................................................................................................................................................................... 5
PORTS USED BY CLOUDERA SEARCH ............................................................................................................................ 5
Ports Used by Cloudera Search ....................................................................................................................... 5
INSTALLING CLOUDERA SEARCH ............................................................................................................................ 5
CHOOSING WHERE TO DEPLOY THE CLOUDERA SEARCH PROCESSES ..................................................................................... 6

CLOUDERA SEARCH INSTALLATION APPROACHES ............................................................................................................. 6
Installing from Packages ................................................................................................................................ 6

BEFORE YOU BEGIN INSTALLING CLOUDERA SEARCH MANUALLY ........................................................................................ 7
INSTALLING SOLR PACKAGES ...................................................................................................................................... 7

DEPLOYING CLOUDERA SEARCH IN SOLRCLOUD MODE ..................................................................................................... 8
Installing and Starting ZooKeeper Server ........................................................................................................ 8

Initializing Solr for SolrCloud Mode ................................................................................................................. 9
Configuring Solr for use with HDFS ................................................................................................................. 9

Creating the /solr Directory in HDFS ............................................................................................................. 11
Initializing ZooKeeper Namespace ................................................................................................................ 11
Starting Solr in SolrCloud Mode .................................................................................................................... 11
Administering Solr with the solrctl Tool ......................................................................................................... 11
Runtime Solr Configuration .......................................................................................................................... 12
Creating your first Solr Collection.................................................................................................................. 13
Adding Another Collection with Replication .................................................................................................. 14

INSTALLING FLUME SOLR SINK FOR USE WITH CLOUDERA SEARCH ..................................................................................... 14
INSTALLING MAPREDUCE TOOLS FOR USE WITH CLOUDERA SEARCH .................................................................................. 15
UPGRADING CLOUDERA SEARCH ......................................................................................................................... 15
CONTENTS .......................................................................................................................................................... 15
UPGRADING CLOUD SEARCH FROM SOLRCLOUD MODE .................................................................................................. 16
UPGRADING CLOUD SEARCH FROM NON-SOLRCLOUD MODE ........................................................................................... 16
INSTALLING AND USING HUE WITH CLOUDERA SEARCH ...................................................................................... 17
IMPORTING COLLECTIONS ....................................................................................................................................... 17

USER UI ............................................................................................................................................................. 18
Customization UI .......................................................................................................................................... 18
Deploying Hue Search .................................................................................................................................. 19
Updating Hue Search ................................................................................................................................... 20

Hue Search Twitter Demo ............................................................................................................................. 20
About this Guide
About this Guide

This guide explains how to install Cloudera Search powered by Solr. This guide also explains how to
install, start, and use supporting tools and services such as the ZooKeeper Server, MapReduce tools for
use with Cloudera Search, and Flume Solr Sink.
Cloudera Search documentation also includes:
Cloudera Search User Guide
Guidelines for Deploying Cloudera Search

This section outlines some of the items and choices that you should consider when deploying Cloudera
Search. Use the following information as a guide to help you form and implement your solutions for
your particular use cases, rather than a list of firm recommendations. Note that there is a tradeoff
between effort and results. Until you have an example of an application, most of this remains
theoretical since one cant necessarily predict what factors will be most important until use cases and
data are better understood.
The importance of use case definition

It is important to define the use cases as early as possible. The same Solr index can have drastically
different hardware requirements usually memory depending upon the queries that are performed.
For example, the memory requirements for faceting vary depending upon the number of unique terms
in the field being faceted upon. Suppose you want to use faceting on a field that has 10 unique values.
Since only 10 counting buckets are required no matter how many documents are in the index,
memory overhead is almost non-existent in this example. But suppose the very same index has unique
timestamps for every entry and you want to facet on that field with a : -type query. This would
require one counting bucket per document in the index. If there are 500 million documents, then
faceting across 10 such fields would increase the RAM requirements significantly.
For this reason, use cases and some characterizations of the data must be known before you can
estimate the hardware requirements. The important parameters to consider are:
Number of documents. For Cloudera Search, its almost almost always the case that sharding is
required.
Approximate word count for each potential field.
What information is stored in the Solr index (that is, returned with the search results) and what
is only for searching.
Foreign language support.
o How many different languages appear in your data?
o What percentage of documents are in each language?
Cloudera Search Installation Guide | 1

Guidelines for Deploying Cloudera Search
o Is language-specific searching to be supported? The issue is whether accent folding and

storing the text in a single field is sufficient.
o What language families are going to be searched? You can, for instance, combine all
Western European languages into a single field, but combining English and Chinese into
a single field doesnt work well. For instance, sometimes accents alter the meaning of a
word, accent folding will lose that distinction.
Faceting requirements
o Be wary of faceting on fields that have many unique terms (for example, timestamps,
free-text fields). Usually faceting on a field with many (more than 10,000 unique values)
is not a useful thing to do. Make sure that any requirement to facet on such fields is
necessary.
o Types of facets. You can facet on queries as well as field values. Faceting on queries is
often useful for dates (for example, in the last day, in the last week and so on).
Using a bare NOW (Solr Date Math) is almost always inefficient. Facet-by-query is not
expensive memory-wise since the number of counting buckets is limited by the
number of queries specified, no matter how many unique values are in the underlying
field.
Sorting requirements
o Sorting requires one int per document (maxDoc) and takes up significant memory.
Additionally, sorting on strings requires storing each unique string value.
Will there be an advanced search capability? If so, what does that look like? Can the Cloudera
Search count on users be more motivated than e-commerce users? There are significant design
decisions that need to be made depending on how motivated the users are. That is:
o Can users be expected to take some time to learn about the system? Advanced
screens are usually intimidating to e-commerce users but may be the best choice when
users can be expected to take some time to learn them.
o How long can your users be patient? Data mining can mean that users can wait multiple
seconds for search results. Of course, you dont want users to wait any longer than
necessary, but theres another set of design decisions related to reasonable response
times.
o How many simultaneous users?
Update requirements. An update in Solr refers both to adding new documents and changing
existing documents.
o Loading new documents.
Bulk. Are there use-cases where the index has to be rebuilt from scratch? Or will
there be an initial load?
2 | Cloudera Search Installation Guide

Cloudera Search Requirements
Incremental. What is the rate of new documents coming into the system?
o Updating documents. Can you characterize the expected number of modifications to
existing documents?
o How much latency is acceptable in terms of the time when a document is added to Solr
and its available for search?
Security requirements. Solr has no built-in security options. In Solr, document-level security is
usually best accomplished by indexing some kind of authorization token(s) along with the
document. The number of authorization tokens applied to a document is largely irrelevant;
thousands are reasonable although such large numbers usually are a nightmare to administer.
The number of authorization tokens associated with a particular user should be much smaller.
100 or so is a good straw-man upper limit. The reason for this is that security at this level is
usually enforced by appending an fq clause to the query and putting thousands of tokens in an
fq clause is expensive.
o There exists a post filter (aka no-cache) filter that can help with access schemes that
cant use the first option. These are not cached and are applied only after all the less-
expensive filters are applied.
o If grouping, faceting isnt required to accurately reflect the true document counts, some
shortcuts can be taken. For example, ACL filtering is notoriously expensive in some
systems, sometimes requiring database access. If accurate faceting is required, you
cannot stop processing partway through the list and still reflect accurate facets.
Required query rate, usually measured in queries-per-second.
o Note that you must size the machines to give a reasonable response rate for a single
user. Its possible to put so much strain on a machine that the target hardware cannot
satisfy even a few users, in which case re-sharding is necessary.
o Absent needing to re-shard, increasing Solrs QPS rate is usually a matter of adding more
replicas to each shard.
o Zillions of shards can show the laggard issue. As the number of shards increases, the
probability that one of them will be anomalously slow increases. The QPS rate will
generally fall, though very slowly, as the number of shards gets into the hundreds.

CDH Requirement
Cloudera Search requires CDH 4.3 or later. For more information, see CDH4 Documentation.

Operating Systems
Cloudera Search provides packages for Red-Hat-compatible, SLES, Ubuntu, and Debian systems as
described below.
Operating System Version Packages
Red Hat compatible
Red Hat Enterprise Linux (RHEL) 5.7 64-bit
6.2 64-bit, 32-bit
CentOS 5.7 64-bit
6.2 64-bit, 32-bit
Oracle Linux with Unbreakable 5.6 64-bit

Enterprise Kernel
SLES
SLES Linux Enterprise Server (SLES) 11 with Service Pack 1 or later 64-bit
Ubuntu/Debian
Ubuntu Lucid (10.04) - Long-Term Support (LTS) 64-bit
Precise (12.04) - Long-Term Support (LTS) 64-bit
Debian Squeeze (6.03) 64-bit
Notes
For production environments, 64-bit packages are recommended. Except as noted above,
Cloudera Search provides only 64-bit packages.
Cloudera has received reports that our RPMs work well on Fedora, but we have not tested
this.
If you are using an operating system that is not supported by Cloudera's packages, you can

Installing Cloudera Search
also download source tarballs from Downloads.
JDK
Cloudera Search requires Oracle JDK 1.6. Cloudera recommends version 1.6.0_31. The minimum
supported version is 1.6.0_8. See Java Development Kit Installation for more information.
Ports Used by Cloudera Search

Cloudera Search uses the ports listed in table below. Before you deploy Cloudera Search, make sure
these ports are open on each system. The table reflects the current default settings, which are defined
in the solr defaults file located in /etc/defaults/solr.
Ports Used by Cloudera Search
Component Service Port Protocol Access Requirement Comment
Cloudera Solr 8983 http External All Solr-specific actions,

Search search/update update/query. Defined in
/etc/defaults/solr.
CDH Cloudera CDH 8984 http Internal CDH Administrative use.

admin

Review Cloudera Search Requirements before getting started.
Install Cloudera's repository: before using the instructions in this guide to install or upgrade
Cloudera Search from packages, install Cloudera's yum, zypper/YaST or apt repository, and
install or upgrade CDH4 and make sure it is functioning correctly. For instructions, see CDH4
Installation and the instructions for Upgrading from CDH3 to CDH4 or Upgrading from an Earlier
CDH4 Release.
Note
Non-SolrCloud mode has been deprecated and is no longer supported.
Cloudera Search provides the following packages:
Package Name Description
solr Solr/SolrCloud

Package Name Description
solr-server Platform specific service script for starting, stopping, or restart Solr.
solr-doc Cloudera Search documentation.
solr-mapreduce Tools to index documents using MapReduce.
flume-ng-solr Flume Solr Sink.
search Examples, Contrib, and Utility code and data.
Choosing where to Deploy the Cloudera Search Processes

You can collocate a Cloudera Search server (solr-server package) with a Hadoop TaskTracker (MRv1) and
a DataNode. When co-locating with TaskTrackers, be sure that the resources of the machine are not
oversubscribed. It's safest to start with a small number of MapReduce slots and increase them gradually.
For instructions describing how and where to install solr-mapreduce, see Installing MapReduce Tools for
use with Cloudera Search. For information about flume-ng-solr, see Installing Flume Solr Sink for use
with Cloudera Search. For information about the search package, see the Using Cloudera Search section
in the Cloudera Search Tutorial topic in the Cloudera Search User Guide.
Cloudera Search Installation Approaches

Cloudera Search currently supports installation using either packages or using Cloudera Manager. For
more information, see:
The Cloudera Manager Installation Guide for information on Install Search using Cloudera
Manager.
Ways To Install CDH4 for details on installing using packages. This page also describes how to
install CDH4 using Cloudera Manager.
Installing from Packages

To install and deploy Cloudera Search, follow the directions on the following page and be sure to review
the Guidelines for Deploying Cloudera Search and the Cloudera Search Tutorial in the Cloudera Search
User Guide.

Before You Begin Installing Cloudera Search Manually

Review the requirements described in Cloudera Search Requirements. The installation instructions
assume that the sudo command is configured on the hosts where you are installing Cloudera Search. If
sudo is not configured, use the root user (superuser) to configure Cloudera Search.
Important
Running services: When starting, stopping, and restarting CDH components, always use the
service (8) command rather than running /etc/init.d scripts directly. This is
important because service sets the current working directory to the root directory (/)
and removes environment variables except LANG and TERM. This creates a predictable
environment in which to administer the service. If you use /etc/init.d scripts directly,
any environment variables continue to be applied, potentially producing unexpected
results. If you install CDH from packages, service is installed as part of the Linux Standard
Base (LSB).
Installing Solr Packages

ForClouderaInternalDevUseOnly
Cloudera Manager can be used to install/manage the CDH cluster (HDFS, MR, flume, etc...) however
be sure to specify CDH 4.2 repo - other repos (4.1/4.3/etc...) will not work properly. During
Cloudera Manager installation, use a custom repo. If you are using CloudCat, be sure the CDH
cluster is using CDH version 4.2.
To get access to the nightly build, run the following (adjust for the version of your RHEL/CentOS):
curl http://repos.jenkins.sf.cloudera.com/solr-beta-
nightly/redhat/5/x86_64/search/cloudera-search.repo | sudo tee
/etc/yum.repos.d/cloudera-search.repo
sudo yum clean all
Before you start

If the Cloudera Search server is already running, stop it before continuing:
sudo service solr-server stop
To install Cloudera Search On Red Hat-compatible systems:
$ sudo yum install solr-server

To install Cloudera Search on Ubuntu and Debian systems:
$ sudo apt-get install solr-server
To install Cloudera Search on SLES systems:
$ sudo zypper install solr-server
See also Deploying Cloudera Search in SolrCloud Mode.
To list the installed files on Red Hat and SLES systems:
$ rpm -ql solr-server solr
To list the installed files on Ubuntu and Debian systems:
$ dpkg -L solr-server solr
You can see that the Cloudera Search packages has been configured to conform to the Linux Filesystem
Hierarchy Standard. (To learn more, run man hier).
You are now ready to enable the server daemons you want to use with Hadoop. You can also enable
Java-based client access by adding the JAR files in /usr/lib/solr/ and /usr/lib/solr/lib/ to
your Java class path.
Deploying Cloudera Search in SolrCloud Mode

SolrCloud allows you to partition your data set into multiple indexes and processes while simplifying the
management via ZooKeeper. In essence, you run a cluster of coordinating Solr servers rather than a
single Solr server.
Before you start

This section assumes that you have already complete the process of Installing Solr Packages. You
are now about to distribute the processes across multiple hosts; see Choosing where to Deploy the
Cloudera Search Processes.
Installing and Starting ZooKeeper Server

SolrCloud mode uses a ZooKeeper Service as a highly available, central location for cluster management.
For a small cluster, running a ZooKeeper node collocated with the NameNode is recommended. For
larger clusters, contact Cloudera Support for configuration help.

Install and start the ZooKeeper service by running the commands shown in the "Installing the ZooKeeper
Server Package and Starting ZooKeeper on a Single Server" section of Installing the Zookeeper Packages.
Initializing Solr for SolrCloud Mode

Once your Zookeeper Service is running, you need to configure each Solr node with the ZooKeeper
Quorum address:
Configure the ZooKeeper Quorum address in /etc/default/solr. Edit the following property to
configure the nodes with the address of the ZooKeeper service. Do this on every Solr Server host:
SOLR_ZK_ENSEMBLE=<zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr
Configuring Solr for use with HDFS

To set up Solr for use with your established HDFS service, perform the following configurations:
1. Configure the HDFS URI for Solr to use as a backing store in /etc/default/solr. Edit the
following property to configure the location of Solr index data in HDFS. Do this on every Solr
Server host:
SOLR_HDFS_HOME=hdfs://namenodehost:8020/solr
Be sure to replace namenodehost with the hostname of your HDFS NameNode (as specified by
fs.default.name or fs.defaultFS in your conf/core-site.xml file); you may also need
to change the port number from the default (8020). On an HA-enabled cluster, you will need to
ensure that the HDFS URI you use reflects the designated nameservice utilized by your cluster.
This value should be reflected in fs.default.name; instead of a hostname, you would see
hdfs://nameservice1 or something similar.
2. In some cases, such as for configuring Solr to work with HDFS High Availability (HA), you may
want to configure Solr's HDFS client. You can do this by setting the HDFS configuration directory
in /etc/default/solr. Locate the appropriate HDFS configuration directory on each node,
and edit the following property with the absolute path to this directory. Do this on every Solr
Server host:
SOLR_HDFS_CONFIG=/etc/hadoop/conf
Be sure to replace the path with the correct directory containing the proper HDFS configuration
files, core-site.xml and hdfs-site.xml.

Configuring Solr use with Secure HDFS

For information on setting up a secure CDH cluster, see the CDH4 Security Guide. In addition to the
above steps for Configuring Solr for use with HDFS, you will need to perform the following additional
steps if security is enabled:
1. Create the Kerberos Principals and Keytab Files
For every node in your cluster:
a. Create the solr principal using either kadmin or kadmin.local (see Create and Deploy the
Kerberos Principals and Keytab Files for information on which to use).
kadmin: addprinc -randkey

solr/fully.qualified.domain.name@YOUR-REALM.COM
b. Create the solr keytab:
kadmin: xst -norandkey -k solr.keytab

solr/fully.qualified.domain.name
2. Deploy the Kerberos Keytab Files

On every node in your cluster:
a. Copy or move the keytab files to a directory that Solr can access, such as /etc/solr/conf.
$ sudo mv solr.keytab /etc/solr/conf/
b. Make sure that the solr.keytab file is only readable by the solruser
$ sudo chown solr:hadoop /etc/solr/conf/solr.keytab

$ sudo chmod 400 /etc/solr/conf/solr.keytab
3. Add Kerberos related settings to /etc/default/solr on every node in your cluster, substituting
appropriate values:
SOLR_KERBEROS_ENABLED=true
SOLR_KERBEROS_KEYTAB=/etc/solr/conf/solr.keytab
SOLR_KERBEROS_PRINCIPAL=solr/fully.qualified.domain.name@YOUR-
REALM.COM

Creating the /solr Directory in HDFS

Before starting the Cloudera Search server, you need to create the /solr directory in HDFS. The
Cloudera Search master runs as solr:solr so it does not have the required permissions to create a
top-level directory.
To create the /solr directory in HDFS:
$ sudo -u hdfs hadoop fs -mkdir /solr

$ sudo -u hdfs hadoop fs -chown solr /solr
Initializing ZooKeeper Namespace

Before starting the Cloudera Search server, you need to create the solr namespace in Zookeeper:
$ solrctl init
WARNING
It must be noted that solrctl init takes a --force option as well. solrctl init --force will
clear the Solr data in ZooKeeper and interfere with any running nodes. If you want to clear Solr data
from ZooKeeper to start over, be sure to stop the cluster first.
Starting Solr in SolrCloud Mode

To start the cluster, start Solr Server on each node:
$ sudo service solr-server restart
After you have started the Cloudera Search Server, the Solr server should be up and running. You can
verify that all daemons are running using the jps tool from the Oracle JDK, which you can obtain from
the Java SE Downloads page. If you are running a pseudo-distributed HDFS installation and a Solr search
installation on one machine, jps will show the following output:
$ sudo jps -lm

31407 sun.tools.jps.Jps -lm
31236 org.apache.catalina.startup.Bootstrap start
Administering Solr with the solrctl Tool

Cloudera Search comes with a command line utility solrctl for performing administrative operations
on configuration bundles and Solr's collection. To display help information about solrctl, run solrctl with
the -help as the only option on the command line.

$ solrctl --help
usage: /usr/bin/solrctl [options] command [command-arg] [command [command-

arg]] ...
Options:
--solr solr_uri
--zk zk_ensemble
--help
--quiet
Commands:
init [--force]
instancedir [--generate path]

[--create name path]
[--update name path]
[--get name path]
[--delete name]
[--list]
collection [--create name -s <numShards>

[-c <collection.configName>]
[-r <replicationFactor>]
[-m <maxShardsPerNode>]
[-n <createNodeSet>]]
[--delete name]
[--reload name]
[--stat name]
[--deletedocs name]
[--list]
core [--create name [-p name=value]...]

[--reload name]
[--unload name]
[--status name]
Runtime Solr Configuration

In order to start using Solr for indexing the data, you must configure a collection holding the index. At a
minimum, a configuration for a collection requires two files: solrconfig.xml and schema.xml (plus
whatever helper files may be referenced from these two). The solrconfig.xml file contains all of the
Solr settings for a given collection, and the schema.xml file specifies the schema that solr will be using
when indexing documents. For more details on how to configure it for your data set see
http://wiki.apache.org/solr/SchemaXml.

WARNING
If Cloudera Manager is managing the cluster, the --zk option must be specified appropriately.
solrctl --zk <zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr ...
Configuration files for a collection are managed as part of the instance directory. To generate a skeleton
of the instance directory run:
$ solrctl instancedir --generate $HOME/solr_configs
You can customize it by directly editing the solrconfig.xml and schema.xml files that have been
created in $HOME/solr_configs/conf.
These configuration files are compatible with the standard Solr tutorial example documents.
Once you are satisfied with the configuration, you can make it available for Solr to use via the following
command that will upload the content of the entire instance directory to ZooKeeper:
$ solrctl instancedir --create collection1 $HOME/solr_configs
You may also use the solrctl tool to verify that your instance directory uploaded successfully and is
available via ZooKeeper:
$ solrctl instancedir --list
which should return a list of instance directory names. For example, "collection1" in this case.
Important
If you are familiar with Apache Solr, you may be tempted to configure a collection directly in solr
home: /var/lib/solr. While this is possible, it is discouraged and the use of solrctl is
recommended instead.
Creating your first Solr Collection

By default, the Solr server comes up with no collections. Make sure that you create your first collection
using the instancedir that you provided to Solr in previous steps by using the same collection name.
(numOfShards is the number of SolrCloud shards you want to partition the collection across. The
number of shards cannot exceed the total number of Solr servers in your SolrCloud cluster):
$ solrctl collection --create collection1 -s {{numOfShards}}

You should be able to navigate to

http://localhost:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true and verify that the
collection is active. You should also be able to observe the topology of your SolrCloud by navigating to:
http://localhost:8983/solr/#/~cloud
Adding Another Collection with Replication

To support scaling for query load, create a second collection with replication. Having a multiple servers
with replicated collections distributes the request load for each shard. Create a one shard cluster with a
replication factor of two. Your cluster must have at least two running servers to support this
configuration, so ensure Cloudera Search is installed on at least two servers before continuing with this
process. A replication factor of two causes two copies of the index files to be stored in two different
locations.
1. Generate the config files for the collection:
$ solrctl instancedir --generate $HOME/solr_configs2
2. Upload the instance directory to ZooKeeper:
$ solrctl instancedir --create collection2 $HOME/solr_configs2
3. Create the second collection:
$ solrctl collection --create collection2 -s 1 -r 2
4. Verify the collection is live and that your one shard is being served by 2 nodes:
http://localhost:8983/solr/#/~cloud
Installing Flume Solr Sink for use with Cloudera Search

The Flume Solr Sink provides a flexible, scalable, fault tolerant, transactional, Near Real Time (NRT)
oriented system for processing a continuous stream of records into live search indexes. Latency from the
time of data arrival to the time of data showing up in search query results is on the order of seconds,
and tunable.
To install the Flume Solr Sink On Red Hat-compatible systems:
$ sudo yum install flume-ng-solr
To install the Flume Solr Sink on Ubuntu and Debian systems:
$ sudo apt-get install flume-ng-solr

Upgrading Cloudera Search
To install the Flume Solr Sink on SLES systems:
$ sudo zypper install flume-ng-solr
For information on using the Flume Solr Sink, see the Flume Near Real-Time Indexing Reference in the
Cloudera Search User Guide.
Installing MapReduce Tools for use with Cloudera Search

Cloudera Search provides the ability to batch index documents using MapReduce jobs. Install the solr-
mapreduce package on nodes where you want to submit a batch indexing job.
To install solr-mapreduce On Red Hat-compatible systems:
$ sudo yum install solr-mapreduce
To install solr-mapreduce on Ubuntu and Debian systems:
$ sudo apt-get install solr-mapreduce
To install solr-mapreduce on SLES systems:
$ sudo zypper install solr-mapreduce
For information on using MapReduce to batch index documents see the MapReduce Batch Indexing
Reference in the Cloudera Search User Guide.
Contents
Upgrading Cloud Search from SolrCloud mode
Upgrading Cloud Search from Non-SolrCloud mode
Upgrading Cloudera Search involves stopping Cloudera Search services, using your operating system's
package management tool to upgrade Cloudera Search to the latest version, and then restarting
Cloudera Search services.
Requirements
Before attempting any upgrades it is extremely important to make backup copies of the following
configuration files:

/etc/default/solr
/var/lib/solr/solr.xml
All collection configurations

Make sure you make a copy on every node that is part of the SolrCloud.
Upgrading Cloud Search from SolrCloud mode

If you already have SolrCloud configuration deployed, do the following:
1. Stop the Solr server:
$ sudo service solr-server stop
2. Upgrade the packages. To upgrade the packages, follow the instructions in the "Installing
Cloudera Search" section of the Installing and Using Cloudera Search guide. Do NOT run yum
update.
3. Start the Solr server:
$ sudo service solr-server start
Upgrading Cloud Search from Non-SolrCloud mode

Non-SolrCloud is now fully deprecated and should be avoided. If you previously configured Cloudera
Search in a Non-SolrCloud mode, you need to move your configuration by using the following steps
before you install any new packages.
While your previous Non-SolrCloud deployment is running, do the following:
1. List all existing instancedirs:
$ solradmin instancedir --list
2. Make a copy of each corecofing:
$ solradmin instancedir --get <NAME> $HOME/<NAME>
3. Stop the Solr server:
$ sudo service solr-server stop

Installing and Using Hue with Cloudera Search
4. Upgrade the packages. To upgrade the packages, follow the instructions in the "Installing
Cloudera Search" section of the Installing and Using Cloudera Search guide. Do NOT run yum
update.
5. Enable SolrCloud mode by specifying SOLR_ZK_ENSEMBLE in /etc/default/solr.

6. Initialize the SolrCloud state.
$ solrctl init
7. Upload ALL the existing configuration into the SolrCloud.
Requirements
It is extremely important NOT to start the upgraded SolrServer service before completing
this step.
8. For every configuration that you saved, do the following:
$ solradmin instancedir --create <NAME> $HOME/<NAME>
9. Start the Solr server:
$ sudo service solr-server start

Hue includes a Search application that provides a customizable search UI.
Importing Collections
The following screenshot is an example of the collection import feature within Hue.

Generally, only collections should be imported. Importing cores is rarely useful since it enables querying
a shard of the index. See A little about SolrCores and Collections for more information.
User UI
The following screenshot is an example of the appearance of the Search application that is integrated
with the Hue user interface.
Customization UI
The following screenshot is an example of the appearance of the Search application customization
interface provided in Hue.
Currently, only super users can access this view.

Deploying Hue Search

You must install and configure Hue before you can use Search with Hue.
1. Follow the instructions for Installing Hue.
2. Use one of the following commands to install Search applications on the Hue machine:
For package installation on RHEL systems:
sudo yum install hue-search
For package installation on SLES systems:
sudo zypper install hue-search
For package installation on Ubuntu or Debian systems:
sudo apt-get install hue-search
For installation using tarballs:
$ cd /usr/share/hue
$ sudo tar -xzvf hue-search-####.tar.gz
$ sudo /usr/share/hue/tools/app_reg/app_reg.py --install
/usr/share/hue/apps/search
3. Update the URL for the Solr Server.

In a Cloudera Manager-managed environment:
a. Connect to Cloudera Manager.
b. Select the Hue service.
c. Click Configuration > View and Edit.
d. Search for the word "safety".
e. Add information about your Solr host to Hue Server (Base) / Advanced. For example, if
your hostname was SOLR_HOST, you might add the following:
[search]
## URL of the Solr Server
solr_url=http://SOLR_HOST:8983/solr
In an environment without Cloudera Manager:

Specify the Solr URL in /etc/hue/hue.ini. For example, to use localhost as your Solr host,
you would add the following:
[search]
# URL of the Solr Server, replace 'localhost' if Solr is
running on another host
solr_url=http://localhost:8983/solr/
4. (Optional) To view files on HDFS, ensure the correct webhdfs_url is included in hue.ini and
WebHdfs is properly configured as described in Configuring CDH Components for Hue.
5. Restart Hue:
$ sudo /etc/init.d/hue restart
6. Open http://hue-host.com:8888/search/ in your browser.
Updating Hue Search

The process of updating Hue search involves installing updates and restarting the Hue service.
1. On the Hue machine, update Hue Search:
$ cd /usr/share/hue
$ sudo tar -xzvf hue-search-####.tar.gz
$ sudo /usr/share/hue/tools/app_reg/app_reg.py --install
/usr/share/hue/apps/search
2. Restart Hue:
$ sudo /etc/init.d/hue restart
Hue Search Twitter Demo

The demo uses similar process to those described in the Running Queries section of the Cloudera Search
Tutorial in the Cloudera Search User Guide. The demo illustrates the following features:
Only regular Solr APIs are used.

Show facets such as fields, range, or dates; sort by time in second.
Result snippet editor and preview, function for downloading, extra css/js, labels, and field
picking assist.
Show multi-collections.
Show highlighting of search term.

Show facet ordering.
Auto complete handler using /suggest.

Cloudera Search Installation Guide

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloudera Search Installation Guide

Uploaded by

Copyright:

Available Formats

Cloudera Search Installation Guide

CHOOSING WHERE TO DEPLOY THE CLOUDERA SEARCH PROCESSES ..................................................................................... 6

Installing from Packages ................................................................................................................................ 6

INSTALLING SOLR PACKAGES ...................................................................................................................................... 7

Installing and Starting ZooKeeper Server ........................................................................................................ 8

Configuring Solr for use with HDFS ................................................................................................................. 9

Adding Another Collection with Replication .................................................................................................. 14

IMPORTING COLLECTIONS ....................................................................................................................................... 17

Updating Hue Search ................................................................................................................................... 20

About this Guide

Guidelines for Deploying Cloudera Search

The importance of use case definition

Cloudera Search Installation Guide | 1

o Is language-specific searching to be supported? The issue is whether accent folding and

2 | Cloudera Search Installation Guide

Cloudera Search Requirements

Cloudera Search Installation Guide | 3

Operating System Version Packages

Red Hat compatible

Red Hat Enterprise Linux (RHEL) 5.7 64-bit

6.2 64-bit, 32-bit

CentOS 5.7 64-bit

6.2 64-bit, 32-bit

Oracle Linux with Unbreakable 5.6 64-bit

Ubuntu Lucid (10.04) - Long-Term Support (LTS) 64-bit

Precise (12.04) - Long-Term Support (LTS) 64-bit

Debian Squeeze (6.03) 64-bit

4 | Cloudera Search Installation Guide

also download source tarballs from Downloads.

Ports Used by Cloudera Search

Ports Used by Cloudera Search

Component Service Port Protocol Access Requirement Comment

Cloudera Solr 8983 http External All Solr-specific actions,

CDH Cloudera CDH 8984 http Internal CDH Administrative use.

Installing Cloudera Search

Cloudera Search provides the following packages:

Package Name Description

Cloudera Search Installation Guide | 5

Package Name Description

solr-doc Cloudera Search documentation.

solr-mapreduce Tools to index documents using MapReduce.

flume-ng-solr Flume Solr Sink.

search Examples, Contrib, and Utility code and data.

Choosing where to Deploy the Cloudera Search Processes

Cloudera Search Installation Approaches

Installing from Packages

6 | Cloudera Search Installation Guide

Before You Begin Installing Cloudera Search Manually

Installing Solr Packages

Before you start

To install Cloudera Search On Red Hat-compatible systems:

$ sudo yum install solr-server

Cloudera Search Installation Guide | 7

To install Cloudera Search on Ubuntu and Debian systems:

$ sudo apt-get install solr-server

To install Cloudera Search on SLES systems:

$ sudo zypper install solr-server

See also Deploying Cloudera Search in SolrCloud Mode.

To list the installed files on Red Hat and SLES systems:

$ rpm -ql solr-server solr

To list the installed files on Ubuntu and Debian systems:

$ dpkg -L solr-server solr

Deploying Cloudera Search in SolrCloud Mode

Before you start