You are on page 1of 32

Integrating Apache Solr with Alfresco WCM for Faceted Search and Navigation of Next-Generation Web Sites

Vagif Jalilov Rivet Logic

About Rivet Logic


Award-winning professional services focused on:
Enterprise Content Management Web Content Management Collaboration and Social Communities

Using Leading Open Source Software

Business Case for Alfresco & Solr


Large scale sites Need for real-time updates Full-text search Faceted search

Technical Challenges for Search


Accurately index each page
Solution: Assembly of relevant content to index

Targeted, real-time indexing


Solution: Trigger indexing from publishing mechanism

Possible Index Solutions


Spidering/Crawling
Follow navigational & cross-links Parse HTML and fetch relevant content Spider full (or partial) site each time

Real-time Indexing
Triggered by FSR deployment Process only change-set (incremental updates) Assemble relevant page content

Typical Web Application


Source Control Source code & libs View templates Site navigation Web content CMS (Alfresco) Binary Content

Managed (Riveted) Web Application


Source Control Source code & libs (View templates) CMS (Alfresco) Binary Content Web Content Site Navigation (View templates)

Page Composition
Metacontent.xml Pagemetadata.xml Relatedlinks.xml

dynamic

Sectionhtml.xml

dynamic

Supportingitems.xml

Content Delivery

(http://crafterrivet.org)

Alfresco WCM Lifecycle

Indexing Architecture

Solr Customizations
Custom Solr
Schema.xml
Fields (Type, Indexed/Stored) Unique key

Solrconfig.xml
dismax type request handler to define queried fields ExtractingRequestHandler (indexing RT docs)

Custom Solr Schema


<field name="page_url" type="string" indexed="true" stored="true" required="true"/> <field name="page_title" type="text" indexed="true" stored="true"/> <field name="page_category" type="string" indexed="true" stored="true"/> <field name="page_type" type="string" indexed="true" stored="true"/> <field name="page_last_modified" type="date" indexed="true" stored="true"/> <field name="page_text" type="text" indexed="true" stored="true"/> <field name="page_file_size" type="int" indexed="false" stored="true"/> </fields> <uniqueKey>page_url</uniqueKey>

ExtractingRequestHandler
<!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler --> <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy"> <lst name="defaults"> <str name="fmap.content">page_text</str> <str name="fmap.title">page_title</str> <str name="uprefix">ignored_</str> </lst> </requestHandler> <dynamicField name="ignored_*" type="ignored"/> ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract"); up.addFile(new File(filePath)); SolrServer solrServer = new CommonsHttpSolrServer(solrServerUrl); solrServer.request(up); solrServer.commit();

Custom RequestHandler
<!-- DisMaxRequestHandler allows easy searching across multiple fields for simple user-entered phrases. It's implementation is now just the standard SearchHandler with a default query type of "dismax". see http://wiki.apache.org/solr/DisMaxRequestHandler --> <requestHandler name=solrDemoDismax" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="qf"> page_title^5.0 page_text^1.0 </str> </lst> </requestHandler>

Compilation
Compiler Engine processes all instructions Dispatches to appropriate Page Type Compiler

Content Deployment & Solr Update

Compiler Instructions
<updates deploy-root=/path/to/content/root"> ... <update>/solutions/security/article.xml</update> <delete>/products/widget/top-section.xml</delete> ... </updates>

Compilation Types
1. Web Pages (HTML) 2. Rich Text (PDF)

Web Page Compilation & Indexing

Indexer Instructions

HTML Indexer Instruction


<?xml version="1.0" encoding="ISO-8859-1"?> <add> <doc> <field name="page_url">/solutions/content-mgmt/overview.html</field> <field name="page_title">Increase productivity and streamline workflow throughout the enterprise</field> <field name="page_description">Commercial enterprises and government agencies face significant challenges as they strive to meet a rapidly growing need to manage thousands ...</field> <field name="page_category>Solutions</field> <field name="page_type">Web Page</field> <field name="page_last_modified">2009-12-18T15:03:57Z</field> <field name="page_text">Rivet Logic addresses many of today's workplace challenges with Enterprise Content Management (ECM) solutions that enable organizations to transform traditional content repositories and static intranets into dynamic, collaborative work environments through open source functionality. Through ...</field> </doc> </add>

Rich Text Compilation & Indexing

Rich Text Indexer Instruction


<?xml version="1.0" encoding="ISO-8859-1"?> <add> <doc> <field name=page_file">/docroot/static/about-us/pressreleases/2010/rl_crafter_studio.pdf</field> <field name=page_url>/about-us/pressreleases/2010/rl_crafter_studio.pdf</field> <field name="page_title>Rivet Logic launches Crafter Studio for user friendly Web content authoring and publishing.</field> <field name="page_category">News</field> <field name="page_type">Press Release</field> <field name="page_last_modified">2007-12-19T08:00:00Z</field> <field name="page_file_size>135</field> </doc> </add>

Compiler Configuration

Compiler Configuration
<compiler-config> <page-types> <page-type name="Solution Page compiler="com.rivetlogic.index.compile.ArticleCompiler"> <uri-pattern pattern=".*/page-content/solutions/.*(article|page-metadata|meta-content).xml$" /> <properties> <property field=page_type value=Web Page/> <property field=page_category value=Solutions/> </properties> </page-type> <page-type name="Press Release Page compiler="com.paetec.index.model.compile.PressReleaseCompiler"> <uri-pattern pattern=".*/press-releases/.*/(press-release|meta-content).xml$" /> <properties> <property field=page_type value=Press Release/> <property field=page_category value=News/> </properties> </page-type> <page-types> <compiler-config>

Search UI
Full text search Faceted search on category & type Pagination or search result clustering Keyword highlighting in search results Track user queries

Search Results Page

Clustered Results

Summary
Requirements:
Real time updates Full editorial control Faceted search

Solution
Alfresco CMS Alfresco plugin for Solr indexing Compile updates & index Serve in UI (ft search + facets)

Q&A
Thank you for attending :-) Questions, comments

Appendix

Search Model/API

You might also like