You are on page 1of 15

Solr

N V Siva Krishna Gontla(MT2011086) International Institute of Information Technology Bangalore sivakrishna.gontla@iiitb.org March 23, 2012

Contents
1 Introduction 2 History 3 Features 4 Installation and Conguration 4.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Running Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Role of solrschema.xml 5.1 Datatypes . . . . . . 5.2 Fields . . . . . . . . 5.3 Dynamic Fields . . . 5.4 Unique Key Field . . 5.5 Default Search Field 3 3 4 5 5 5 5 6 6 6 7 8 8 8 8 8 9 10 11 11 11 12 12 12 13 13 14

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

6 Adding new content and making them 6.1 Posting a document and indexing it . 6.1.1 Posting and indexing a xml le 6.1.2 Posting and indexing a pdf le 6.1.3 Indexing MYSQL database . . 6.2 Searching . . . . . . . . . . . . . . . .

searchable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

7 Filtering the search results 7.1 Filter query parameters . . . . . . . . . 7.2 Highlighting . . . . . . . . . . . . . . . . 7.3 Enabling Facets on search results . . . . 7.3.1 How to enable facet search? . . . 7.4 Clustering the search results . . . . . . . 7.4.1 How to cluster the search results 8 Distributed Searching

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

9 How to include Solr functionality to any java based project? 14 10 Public Websites using Solr 11 References 2 15 15

Introduction

Connecting users with the content they need when they need it isnt just optional anymore. With the rise of Google and similarly sophisticated search engines, users expect high-quality search results that help them nd what theyre looking for quickly and easily. Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable[1]. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has API that make it easy to use from virtually any programming language. s Solrs powerful external conguration allows it to be tailored to almost any type of application without Java coding[2].

History
In 2004, Solr was created by Yonik Seeley at CNET Networks as an in-house project to add search capability for the company website. Yonik Seeley along with Grant Ingersoll and Erik Hatcher went on to launch Lucid Imagination, a company providing commercial support, consulting and training for Apache Solr search technologies[2]. In January 2006, CNET Networks decided to openly publish the source code by donating it to the Apache Software Foundation under the Lucene top-level project Like any new project at Apache Software Foundation it entered an incubation period which helped solve organizational, legal, and nancial issues[2]. In January 2007, Solr graduated from incubation status and grew steadily with accumulated features thereby attracting a robust community of users, contributors, and committers. Although quite new as a public project, it is already used for several high-trac websites[2]. In September 2008, Solr 1.3 was released with many enhancements including distributed search capabilities and performance enhancements among many others[2].

November 2009 saw the release of Solr 1.4 This version introduces enhancements in indexing, searching and faceting along with many other improvements such as Rich Document processing (PDF, Word, HTML), Search Results clustering based on Carrot2 and also improved database integration. The release also features many additional plugins[2]. In March 2010, the Lucene and Solr projects merged. Separate downloads will continue, but the products are now jointly developed by a single set of committers[2].

Features
A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys Powerful Extensions to the Lucene Query Language Faceted Search and Filtering Geospatial Search Advanced, Congurable Text Analysis Highly Congurable and User Extensible Caching Performance Optimizations External Conguration via XML An Administration Interface Monitorable Logging Fast Incremental Updates and Index Replication Highly Scalable Distributed search with sharded index across multiple hosts JSON, XML, CSV/delimited-text, and binary update formats Easy ways to pull in data from databases and XML les from local disk and HTTP sources

Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika Apache UIMA integration for congurable metadata extraction Multiple search indices[3]

4
4.1

Installation and Conguration


Prerequisites
Java 1.5 or higher. Any webbrowser.

4.2

Sources

Download Apache Solr from here

4.3

Running Solr

Change the present working directory to /apache-solr/example/ and run the command java -jar start.jar to start the apache solr.

and to see if the solr is successfully running or not, visit the url http: //localhost:8983/solr/admin/ if the page successfully opens, solr is running witout any problems.

Role of solrschema.xml

The schema.xml le contains all of the details about which elds our documents can contain, and how those elds should be dealt with when adding documents to the index, or when querying those elds[4].

5.1

Datatypes

The <types>section allows you to dene a list of <eldtype>declarations you wish to use in your schema, along with the underlying Solr class that should be used for that type, as well as the default options you want for elds that use that type. Possible datatypes are:int, oat, long, double, binary, boolean, string, text general

5.2

Fields

The <elds>section is where you list the individual <eld>declarations we wish to use in our documents. Each <eld>has a name that we will use to reference it when adding documents or executing searches, and an associated type which identies the name of the eldtype we wish to use for this eld. There are various eld options that apply to a eld. These can be set in the eld type declarations.

default The default value for this eld if none is provided while adding documents indexed=true or false True if this eld should be indexed. If (and only if) a eld is indexed, then it is searchable, sortable, and facetable. stored=true or false True if the value of the eld should be retrievable during a search compressed=true or false True if this eld should be stored using gzip compression. (This will only apply if the eld type is compressable; among the standard eld types, only TextField and StrField are.) multiValued=true or false True if this eld may contain multiple values per document, i.e. if it can appear multiple times in a document omitNorms=true or false Set to true to omit the norms associated with this eld (this disables length normalization and index-time boosting for the eld, and saves some memory). Only full-text elds or elds that need an index-time boost need norms.

5.3

Dynamic Fields

One of the powerful features of Lucene is that we dont have to pre-dene every eld when you rst create your index. Even though Solr provides strong datatyping for elds, it still preserves that exibility using Dynamic Fields. Using <dynamicField>declarations, we can create eld rules that Solr will use to understand what datatype should be used whenever it is given a eld name that is not explicitly dened, but matches a prex or sux used in a dynamicField. For example the following dynamic eld declaration tells Solr that whenever it sees a eld name ending in i which is not an explicitly dened eld, then it should dynamically create an integer eld with that name. <dynamicField name=* i type=integer indexed=true stored=true>

5.4

Unique Key Field

The <uniqueKey>declaration can be used to inform Solr that there is a eld in our index which should be unique for all documents. If a document is added that contains the same value for this eld as an existing document, the old document will be deleted.It is not mandatory for a schema to have a uniqueKey eld[4].

5.5

Default Search Field

The <defaultSearchField>is used by Solr when parsing queries to identify which eld name should be searched in queries where an explicit eld name has not been used. It is preferable to not use or rely on this setting; instead the request handler or query LocalParams for a search should specify the default eld(s) to search on. This setting here can be omitted and it is being considered for deprecation[4].

Adding new content and making them searchable


Posting a document to solr and let it to index. Searching from solr admin interface

This phase includes of two steps:

How to post dierent types of documents to solr?

6.1

Posting a document and indexing it

We can post a document to solr and indexing will be done automatically by solr. 6.1.1 Posting and indexing a xml le

Change the working directory to apache solr/example/exampledocs and run the command java -jar post.jar lename.xml. The successfull execution of this command means that document is successfully submitted to solr and it is indexed so that the contents in it are searchable. In the place of lename we can give a wildcard character * which species solr to index all

xml les present in that directory

6.1.2

Posting and indexing a pdf le

Follow the same procedure specied above to index a pdf le , but add an extra perimeter to the command , the full command is java -Durl=http://localhost:8983/solr /update/extract?literal.id=a -Dtype=application/pdf -jar post.jar Data Preprocessing.pdf

6.1.3

Indexing MYSQL database

DataImportHandler is used to index mysql database. The steps involved in it are:


add the <requestHandler>section to the solrcong.xml

<requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler> <lst name=defaults> <str name=cong>data-cong.xml</str> </lst> </requestHandler>


Paste the database specic jdbc driver in apache-solr/example directory Create new le in /example/solr/conf directory with name data-cong.xml which is database specic and says about the location of jdbc driver and the datatypes of elds of the tables in the database if those elds are not present in schema.xml add them. For example,

<dataCong><dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/rss user=root password=root/> <document name=rss content> <entity name=node query=select * from rss content> <eld column=id name=id /> <eld column=description name=description/> <eld column=title name=title /> </entity> </document> </dataCong>
Restart the solr server from command-line. Use full import (or) delta-import command to import the database and let solr to index it. Full import will import entire database and indexes it, where as delta-import just imports the changes and indexes will be created. Full import command: http://localhost:8983/solr/db/dataimport?command=fullimport

10

Delta import command:http://localhost:8983/solr/db/dataimport?command=deltaimport

6.2

Searching

Once the document is posted and indexed we can be able to search the document through the solr admin interface. Open the url http://localhost: 8983/solr/admin/ and in the search box type your keyword to search. By default, if we are not specifying in which eld solr has to search it will search in the <defaultsearcheld>eld specied in solrschema.xml le. If we want to specify on which eld solr has to search in search box type in the following format:eldname:keyword to search, then it will search for the given keyword in the specied eld and the results will be displayed.

7
7.1

Filtering the search results


Filter query parameters

Along with the main keyword to search, some extra parameters can also be attached to the search query. A brief description can be drawn from the following picture[1].

11

7.2

Highlighting

Solr provides the feature of highlighting the results it fetched based on the search query. In search url specify h=true to enable highlighting and use h.=elds list to specify which elds have to be highlighted other wise the eld specied in <default-searcheld>in schema.xml le by default[5].

7.3

Enabling Facets on search results

Faceted search is the dynamic clustering of items or search results into categories that let users drill into search results by any value in any eld. Each facet displayed also shows the number of hits within the search that match that category. Users can then drill down by applying specic contstraints to the search results. Faceted search is also called faceted browsing, faceted navigation, guided navigation and sometimes parametric search[5]. 7.3.1 How to enable facet search?

In solrcong.xml we should specify on which eld facets should be enabled. <str name=facet.eld>category</str> Facets can be clearly visible through solrs http://localhost:8983/solr/browse

12

7.4

Clustering the search results

Clustering can be considered the most important unsupervised learning problem. It deals with nding a structure in a collection of unlabeled data. A loose denition of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

7.4.1

How to cluster the search results

Solr uses Carrot2 to cluster the search results. Modify the following entry in solrcong.xml to specify on which eld clustering should be done. <str name=carrot.snippet>description</str> In the above example clustering is done on description eld. Carrot2 internally uses STC or Lingo algorithm to cluster the search results. For more information visit this link http://project.carrot2.org/algorithms.html

13

To enable clustering on search results, change the working directory to /apache-solr/example and start solr with clustering feature enabled, this can be done with the following command: java -Dsolr.clustering.enabled=true -jar start.jar

Distributed Searching

When an index becomes too large to t on a single system, or when a single query takes too long to execute, an index can be split into multiple shards, and Solr can query and merge results across those shards. Solr has the ability to support Distributed Searching. Specify multiple systems in the shard list, whenever a seary query comes all the systems in the list will be searched for the given keyword and results will be merged[5]. For simple testing, just set up two copies of apache-solr on dierent ports and start the servers with standard startup commands supplied above. Now if you want to use distributed search , modify your search url as follows http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost: 7574/solr&indent=true&q=ipod+solr where the two instances of solr are running on ports 8983 and 7574(if they are on same system specify localhost otherwise specify their respective ip addresses)

How to include Solr functionality to any java based project?

Solr readily has its SolrJ API which can be used in any of the existing applications easily. For more information visithttp://wiki.apache.org/ solr/Solrj 14

10

Public Websites using Solr


Handmade Kultur is as social platform for the DIY Scene. It uses solr for searching the projects, courses and events. http://www.jobseeker.com.au is a job search engine powered by Solr 3.5. http://www.whitehouse.gov/ - Uses Solr via Drupal for site search highlighting & faceting JobHits Jobs search is the Solr powered job search engine using faceted navigation and collapsing. JobHits has 3 websites in UK, US and Canada FCC.gov is the new FCC website featuring Solr powered search and faceted navigation. Jeeran uses Solr 4.0 to help users search about any place [restuarent,cafe]in the Middle East . Comcast / xnity uses Solr to power site search and faceted navigation. AT&T Interactive uses Solr to run local search at yp.com, the new yellowpages.com and many more......http://wiki.apache.org/solr/PublicServers

11

References

1. http://www.ibm.com/developerworks/java/library/j-solr1/ 2. http://en.wikipedia.org/wiki/Apache_Solr 3. http://lucene.apache.org/solr/features.html 4. http://wiki.apache.org/solr/SchemaXml 5. http://wiki.apache.org/solr/

15

You might also like