Professional Documents
Culture Documents
Deployment Guide
September 2009
Google Inc.
1600 Amphitheatre Parkway
Mountain View, CA 94043
www.google.com
21 September 2009
Google, the Google logo, Google Search Appliance, GSA, the Google Mini, Google Site Search, and GSS are trademarks,
registered trademarks, or service marks of Google Inc. All other trademarks are the property of their respective owners.
Use of any Google solution is governed by the license agreement included in your original contract. Any intellectual property
rights relating to the Google services are and shall remain the exclusive property of Google, Inc. and/or its subsidiaries
(“Google”). You may not attempt to decipher, decompile, or develop source code for any Google product or service offering,
or knowingly allow others to do so.
Google documentation may not be sold, resold, licensed or sublicensed and may not be transferred without the prior written
consent of Google. Your right to copy this manual is limited by copyright law. Making copies, adaptations, or compilation works,
without prior written authorization of Google is prohibited by law and constitutes a punishable violation of the law. No part of
this manual may be reproduced in whole or in part without the express written consent of Google. Copyright © by Google Inc.
Google provides this publication “as is” without warranty of any either express or implied, including but not limited to the implied
warranties of merchantability or fitness for a particular purpose. Google may revise this publication from time to time without
notice. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions; therefore, this
statement may not apply to you.
Chapter 1: Introduction....................................................................................... 5
Welcome to the Google Search Appliance............................................................ 5
About this guide..................................................................................................... 6
Disclaimer for Third-Party Product Configurations ................................................ 8
Contents 3
High availability architecture................................................................................ 65
Disaster recovery deployment architecture ......................................................... 69
Integrated architectures....................................................................................... 71
Security solutions ................................................................................................ 75
Federated architecture ........................................................................................ 79
Unlike many enterprise applications, the Google Search Appliance is designed to be self
sufficient: hardware, software, networking, storage, and security support are built in, and can
be easily supplemented with additional capabilities.
This document outlines several considerations for successfully deploying Google Search
Appliances to meet the document capacity, scalability, and redundancy needs of an
enterprise.
Great value
Because the Google Search Appliance is self-contained, it delivers core search capabilities
out of the box with no additional hardware required. However, you can supplement the search
appliance with off-box capabilities to deliver universal search at a compelling price. Ongoing
operating cost is lowered by reducing the effort to administer and maintain a search solution
substantially, delivering powerful, intuitive search at a low, compelling Total Cost of Ownership
(TCO).
5
Easy integration
The Google Search Appliance seamlessly integrates with existing information technology (IT)
infrastructures through industry standards and best practices. Custom integration can be
delivered through open standards, such as Security Assertion Markup Language (SAML) for
Single Sign-On (SSO) and heterogeneous security, and well-documented, standard
Application Programming Interfaces (APIs).
Constant innovation
Innovation is the hallmark of Google Enterprise. The Google Search Appliance takes
advantage of the innovations tested on Google.com and proven by hundreds of millions of
users worldwide. In addition to regular software releases, you can add innovations to a search
solution from Google Enterprise Labs or by harnessing the power of Google’s cloud
capabilities to deliver core search capability.
The Google Search Appliance’s flexible architecture and open technologies enable you to
deploy it rapidly. Once deployed, the search appliance offers increased value by unlocking
more of the value in your business’s information assets through continuous innovation,
incorporation of additional content, and rich user functionality.
This guide assumes basic knowledge of the Google Search Appliance. However, this guide is
not a technical “how-to” document. For in-depth information, visit Google’s rich and
comprehensive public search appliance documentation at http://code.google.com/apis/
searchappliance/documentation/index.html.
A search solution can be deployed as a traditional monolithic project or by using agile, even
extreme project methodologies. Whatever the project methodology, there are guiding
principles that have been used in most successful search implementations. This document
discusses these guiding principles, giving you the information you need to plan your
deployment with the right phases or micro-phases.
In this guide, you can also find comprehensive information about the following topics:
• How to plan deployment phases to achieve quick wins, while delivering ongoing value
This guide also provides useful information for other technical and managerial personnel who
are involved in making decisions about IT infrastructure for your company.
Although Google recommends that you read this entire guide, you don’t have to. Depending
on your organization’s infrastructure, your goals, and your own experience, you can use this
guide as a reference and read just the sections that are applicable to you.
Introduction 7
Resources that complement this guide
For a detailed list of the resources that this guide refers to, see “Other Resources” on
page 125.
search-deployment-guide@google.com
In your message, be sure to tell us the specific section to which your comment applies.
Thanks!
Google does not provide technical support for configuring servers or other third-party products
outside of the Google Search Appliance, nor does Google support solution design activities. In
the event of a non-Google Search Appliance issue, you should contact your IT systems
administrator. GOOGLE ACCEPTS NO RESPONSIBILITY FOR THIRD-PARTY PRODUCTS.
Please consult a product’s web site for the latest configuration and support information. You
might also contact Google Solutions Providers for consulting services and options.
To make the most out of your search deployment, you need to understand how users in your
organization will use search. You also need to understand the content and processes that will
benefit from search, and the architecture that will support it.
This chapter presents issues and questions that will help you in:
The information that you gather as you address the issues listed in this chapter helps you to
define your deployment architecture and project plan.
For a simple deployment, you might gather information in a single meeting. For more complex
deployments, you might use a series of workshops and surveys.
9
Understanding your users
The success of your deployment hinges on how much your users use the search solution and
how effectively they do so.
The Google Search Appliance delivers powerful search capabilities out of the box, including a
search experience that the vast majority of your users are already familiar with from
Google.com. However, you can substantially enhance the user appeal and overall richness of
the search experience by understanding your users and what they will be trying to do with
search.
To understand your users and their search needs, consider the following questions.
How many users do you have • Are users internal, external, or both?
and where are they?
What will your users be using • It’s not just a search capability – what benefit will your
the search appliance for? users get?
• What does the search experience need to provide for
users to regard it as successful?
As part of this activity, get an understanding of what your index capacity needs will be. For
information about this topic, see “Sizing the index” on page 45.
What are your content sources? • Typical content sources that are often incorporated into a
search deployment include:
• Intranet sites
• Your company website(s)
• File systems and shared drives
• Content Management Systems (CMS), such as
Documentum
• Record/Document Management Systems (RMS/
DMS)
• Portals or collaboration sites, such as SharePoint
• Archives
• Databases
• Line Of Business (LOB) applications
• Other structured data
What are the details about each • For each content source, identify:
content source?
• How the content can be accessed
• Roughly how many documents it contains
• Whether the content is:
- Structured, for example, customer records
- Unstructured, for example, a Word document
- Both, for example, a customer letter
(unstructured) in an RMS (structured)
• Whether the content is secured
• How content is secured
• Who uses it (or who you want to use it)
• How important it is
• How frequently it changes
• What kind of publishing process (if any) governs
content revisions
For example, think how much faster a call center employee could answer questions about a
refund policy for a purchased product if she can simply search the policy database—and bring
up the purchase order in the same rich search window.
In many cases, you might discover that processes also produce information that you want to
make available through search. Also, it may be valuable to have visibility over in-flight
business processes, such as being able to search currently open cases in a support queue.
So you might want to enable the search appliance to crawl this information or otherwise
integrate it with the search appliance.
Think about your physical network design, and where the content is located—both
geographically and from a network design perspective. Also, think about your requirements for
security. Security architecture is particularly important for internal deployments of the search
appliance, and requires planning.
What are your physical • Are the content systems located on fast Ethernet
systems? switches?
• What are the peak usage times for each content system
– daily, weekly, monthly and/or quarterly?
• Will the search appliance be located on a part of the
network that requires access through a firewall or proxy
to get to the content?
What is the security • Do you have a single security mechanism for all content,
infrastructure surrounding you or do you have a “heterogeneous” authentication/
content? authorization environment?
• Will users require several identities/passwords to access
all protected content, or is there a single sign-on solution
in place?
• Do you have Active Directory (AD)? What version? Is
Active Directory installed in Native Mode or Mixed
Mode?
• Do you have NTLM v2?
A successful search solution is conceptually very simple: help users find the information they
are looking for. Make search fast, make it easy, and make it relevant.
The Google Search Appliance takes care of the speed, ease and relevance. But you need to
plan and execute the project to take full advantage of the power of the search appliance. Key
to this approach is remaining focused on short delivery cycles and structuring work around
this.
Every deployment of a Google search solution is unique. You might be providing search
across SharePoint content and extending core search with purchase orders from SAP. Or you
might be providing search of the hundreds of thousands of documents that businesses tend to
accumulate over time, bringing them together with policy documents, and the contact details
of the people who wrote them.
Although each deployment has different content sources, security requirements, and user
needs, there are core planning activities with fundamental guiding principles that apply to all
search deployments. This chapter focuses on the following core planning activities:
15
Capturing requirements
As you capture requirements, group them into related sets that you can prioritize and align
with phases of work. In general, focus on the following areas:
User requirements
Understand what is important to make the deployment successful from the user perspective.
In general, user requirements focus on:
Usability
For users, search should not be a chore. Defining usability requirements can help ensure that
users find your search solution intuitive and effective.
• What are the usability features that really make the search solution resonate with users?
In general, meet usability requirements as early in the release cycle as possible because
these are not typically tied to content sources and they can get users excited about the search
solution.
As you identify breadth and depth requirements, consider the following issues:
• Where possible, the largest groups and the users experiencing the most frustration
today should be brought on first.
• Using search appliance front ends, you can present a different look and feel and
different content to various users, based on their needs. For information about front
ends, see “Using the search appliance’s front ends” on page 97.
• What are they trying to find now, but are frustrated that they can’t?
As you identify communication and feedback requirements, consider the following issues:
• In addition to adding new content and exciting new features, it’s important to make
sure to tell your users about them to keep them excited about the product, and get
kudos on your successes.
• Because most of your users already know how to use Google search technology,
training needs typically are minimal, but make sure your users know that they can now
search enterprise content with the same ease as they search the internet at home.
• User feedback is one of the best measure of success. Consider conducting periodic
surveys with user groups. See the sample search satisfaction survey on page 121.
• Also consider providing a feedback link for users.
Scenarios that encompass content and security can range in complexity from completely
unsecured public website pages to complex integration with an Enterprise Resource Planning
(ERP) system such as SAP or PeopleSoft, and everything in between.
Plan your end-state architecture in the early phases, but also phase in both content and
security. In other words, don’t delay delivering a great search experience to your users
because you want to index every last scrap of content or implement a security framework they
won’t need until later.
Content
In general, analyze all potential repositories of organizational information. Although the
Google Search Appliance excels at providing powerful, fast, and relevant search across
unstructured content, you should not exclude structured content, such as your data
warehouse, transactional systems, and so on.
It is important to understand how content sources relate to each other, as this will help you
define how to phase deployment of content. For example, content from a case management
system may be supplemented effectively with content from a product catalog, enabling users
to see not only product information, but also the types of problems and issues that users
encounter when using the products.
The following table lists various types of structured and unstructured content sources and
considerations that can help you define how to phase its deployment.
Unstructured
Structured/
(L/M/H)
Complexity
(L/M/H)
Complexity
Content source Consideration
Security
Security can be the area of greatest complexity in a search deployment. As you analyze
content, understand if it is secured, and if so, how it is secured (forms-protected, cookies,
protected by application-level security, and so on).
For comprehensive information about the search appliance and security, see “Managing
Search for Controlled-Access Content” at http://code.google.com/apis/searchappliance/
documentation/60/secure_search/secure_search_overview.html.
The Google Search Appliance can make use of standard security protocols, such as NTLM or
forms-based security.
Understanding all the security permutations will help you plan for content acquisition. For
example, security might have an impact on web and file system crawl that you need to plan for,
such as configuring a proxy or ensuring your Windows file systems have CIFS enabled to
support SMB crawling.
More complex security might require alternative means of content acquisition, such as feeds
or connectors.
When serving secured content, the Google Search Appliance first checks that the user is
entitled to see relevant results. If the user is not entitled to view a document, it does not appear
in the result set.
Of course, you can always choose to make results public and apply no security at serve time.
In many cases, search can initially be deployed unsecured, with security added as more
content is acquired. Public search (such as an externally facing internet site) is typically
deployed this way.
• Purchase pre-built providers (see the Google Enterprise Solution Marketplace for
examples)
With the release of version 6.0, the Google Search Appliance also supports definition of policy
access control lists (ACL), so that authorization checks can be performed against documents
using early binding. Policy ACLs not only enhance performance, but give you more options for
managing security. This new capability also gives you options to phase your secure search
deployment. For information about policy ACLs, see “Access control list caching” on page 59.
For more information about secure serve, see “Serving secure content” on page 57.
• Security requirements
• Content type
• Corpus size
• Additional search functions used (for example, query expansion or metadata filtering)
As a rule, if there are specific performance requirements, you should conduct a performance
test early in the deployment to determine changes that may need to be made to the solution
architecture.
Although the Google Search Appliance itself cannot be modified, changes you can incorporate
into your planned deployment include:
• Deploying a reverse proxy to cache where possible for common searches. This change is
beneficial only for public (non-secured) content searches.
• Minimizing network traffic between the Google Search Appliance and content sources.
Although this change mostly has an impact on crawl, reduced latency will improve
performance of late-binding authorization.
• Deploying additional search appliances to spread the load. This change reduces the
demand on any single search appliance and helps ensure that capacity is not a
constraining factor.
See “Architecting for scale and performance” on page 48 for further discussion of
performance-driven search architecture.
Performance requirements should also take crawling and indexing into consideration. Search
appliance indexing adds load to your content systems. If there are specific times of the day in
which the content systems must not be affected, then you need to understand this so that you
can configure search appliance host load schedules accordingly. Furthermore, if the content
system is sufficiently strained, or is particularly slow, you might consider content feeds as an
alternative.
Scalability
Scalability requirements typically revolve around number of queries per second (QPS) or
queries per minute (QPM). As with performance, the QPS that the solution supports depends
on the security requirements, content type, query type, network performance, and a host of
other factors.
While search solutions can be designed to support hundreds of queries per second, in
practice, this is not usually required. The kind of scalability requirements needed from a
search solution are substantially different from those of a transactional system.
For more details about designing a search solution for increased scalability, see “Architecting
for scale and performance” on page 48.
For information about the number of concurrent connections that the Google Search
Appliance can accept, see “Designing a Search Solution” at http://code.google.com/apis/
searchappliance/documentation/52/troubleshooting/Designing_Search_Solution.html#Queueing.
• The analytical technology to be used (for example, Google Analytics, Advanced Search
Reporting, or some other third-party tool)
• Other reporting types that may be required (for example, administration events)
Make sure that you understand the business processes that will use these reports. For
example, you should understand the use cases for your reporting requirements and make sure
that the reporting strategy will deliver on them.
Identifying phases
Most search deployments fall into one of the following categories, listed from simplest to most
complex:
A search deployment typically targets quick wins to deliver a rich search experience to users
rapidly, with incremental, iterative delivery of additional value over the life of the search
deployment.
Deployment phases
The key to successful search deployments is to deliver early and deliver often. Don’t try to do
everything at once. Your users will benefit from getting access to the content they want as
early as possible. Delivering early means quick wins that can help drive support with your
stakeholders and generate excitement and visibility with your users.
• Content sources
• Security
• User groups
• Usability features
Each phase should include an evaluation task, where you explicitly evaluate user satisfaction,
and feature requests. As always, evaluate feature requests, including risks associated with
implementing—and not implementing.
In general, since each phase is of relatively short duration, you can use most delivery
methodologies, ranging from Agile to Life Cycle.
This section discusses how you can structure your deliverables and project plans to broaden
the search footprint and increase use of your search solution. Each delivery moves your
deployment further along the value curve.
Where to start
The Google Search Appliance is designed to be rapidly deployed over core content sources.
Leveraging open standards and protocols allows rapid integration of content from a variety of
sources and implementation of rich usability features, such as Search-as-you Type, user-
added results, and dynamic results clusters.
Phases can be as short as a week or two or as long as a month. Google recommends that you
structure your program of work to aim for shorter phases, with rapid delivery of iterative
functionality, content, or user groups.
In many cases, a single rapid delivery phase is all that is required. However, even when your
deployment is part of a longer running, comprehensive program of work delivering universal
search across all your enterprise assets, you should still structure your phases to deliver quick
wins.
Before you commence your search deployment, complete the following core tasks, so that
your search deployment specialist can get your search appliance up and running as quickly as
possible.
Early development
Delivery items listed in the following table are typically relatively quick and easy to deliver.
Consider them as candidates for early development. Many of these could be considered
mandatory—a custom front end for example, no matter how simple, should always be a part of
the core delivery.
• Intranet
• Extranet
• Website
• Wiki
• Web-enabled knowledge bases (for example,
Lotus Notes)
Incremental releases
Delivery items listed in the following table are candidates for incremental release. Consider
these items and schedule their deployment according to priority (typically based on volume of
content, and business criticality), and level of effort.
In many cases, you can accelerate delivery by using third-party tools (such as connectors) and
certified Google Enterprise partners, who are experienced in Google Search Appliance
integration issues. Some of these delivery items (for example, customized advanced search)
might require some user feedback before full implementation.
In some cases, items are structured data sources that require analysis before understanding
how best to integrate into the search experience (for example, Business Intelligence
platforms).
Complexity Duration
The times in this table are guidelines only and will vary, based on your environment and
requirements. Google recommends that you perform an analysis to determine the work effort
specific to your deployment.
In addition to the work effort, you need to allow enough time to acquire content. Strive for
having as much content in the index as possible from targeted content sources. This is not to
say that you should wait until you get every possible content source into your search solution,
but rather that you should have in the index all the content from the systems you are
incorporating in the current release.
• Network performance
• Server performance
• Host load
• Content type
Google recommends running some tests early in the project life cycle to determine content
acquisition speed. Use this information to help you plan accordingly.
• Security tests (authentication and authorization is working for all secured systems)
However, as with any enterprise solution, there are some tasks that should be carried out
regularly. These are discussed in “Post Deployment” on page 83. You need to plan your
resourcing to manage these tasks, as the operational team who will be responsible for BAU
may not be the same as the team who deployed.
Perform tasks in preparation for transition to BAU as described in the following sections:
• Creating and managing administrator and manager roles on the search appliance
• Any processes around additional technologies (for example, OneBoxes modules, SAML
providers, and so on)
• Migrating code assets and configurations from your development environment to your
production environment
• Remote access details for chosen methods (SSH configuration and routing, support call,
and so on)
• License information
This preparation allows for efficient use of Google Enterprise Support, should you need it.
You can also output logs to a syslog server to leverage third-party log processing tools that
you might already have in use.
Configure Monitoring
Establish a method for monitoring your Google Search Appliance. You can use SNMP, or
some of the monitoring tools discussed in “Designing a Search Solution,” at http://
code.google.com/apis/searchappliance/documentation/60/troubleshooting/
Designing_Search_Solution.html#Monitoring.
You could also monitor your search appliance by using a custom solution. Anything that allows
you to monitor your search appliance actively will give you additional confidence and stability
in your deployment, and will allow you to identify problems early.
For example, when a new policy document or product is launched, KeyMatches relating to the
old version may need to be updated. The BAU team needs to be aware of them, and the
appropriate business owners.
The project scenarios in this chapter illustrate how a successful deployment might be
executed. These scenarios include:
• Internal search over intranet, file system, SharePoint, and Notes, described on page 38
• The deployment team is familiar with the Google Search Appliance. If required, a certified
Google partner can help.
• There are no significant problems in the deployment environment. All environments differ,
and yours may have unforeseen complexity.
The time lines and project plans used in this document, while examples, should not be taken
as reference plans. Your own time lines might reflect greater complexity. When you plan a
deployment project, take specific business or technical requirements into consideration.
Always include contingencies in your plans.
31
Basic search on a public website
In the use case for this project scenario, Alpha inc. is deploying search over a public-facing
website containing a massive amount of information about products sold in their retail stores.
Most of the content is public, but there is also protected content in a secured members section
for customers who have purchased a product and registered it. While all users search for
public, product content, members might also search for protected content, such as support
information.
Scenario summary
Project Scenarios 33
Project plan
The following figure shows a generalized gantt chart for deploying basic search on a public
website.
Enhancements
The initial deployment should also be followed by a set of rapid enhancements with short
delivery cycles. Enhancements include:
• Product OneBox module—retrieves product pricing and availability directly from supply
chain system. When a user searches for “gadget,” they will also get the price, and
availability of gadgets in real time.
• Store locator OneBox module—for logged in users, this OneBox module could retrieve
information about stores within 10 miles of them and display it by means of Google Maps.
The following figure shows a generalized gantt chart for enhancement phases for deploying
basic search on a public website.
Although all the users are employed with CorpCom, not all of them have access to all the
information on the various sites in their corporate domain. For example, Human Resources
(HR) information access is desirable by means of the search, therefore securing personal
information is important.
Scenario summary
Project Scenarios 35
Key requirements • Index all pages that are web accessible.
• Index foreign language content.
• Seamless sign-on.
• A standard search page where employees go to search
for information.
• Secure content must only be accessible to users
authorized to see it.
Enhancements
The initial deployment should also be followed by a set of rapid enhancements with short
delivery cycles. Enhancements include:
The following figure shows a generalized gantt chart for enhancement phases for deploying
basic internal search.
Enhancement phases
Project Scenarios 37
Internal search over intranet, file system,
SharePoint, and Notes
In the use case for this project scenario, Cybertron Appliance Inc. houses different data
corpora that are being served up on different servers on their corporate network. These data
silos are accessed by way of different data management applications such as SharePoint,
Lotus Notes Databases, as well as secure file shares.
Having to go to different applications to find information has become tedious and very time
consuming for their employees. Not only that, the loss in productivity in trying to locate a
particular piece of information has started to show up on their bottom line because of the
repetitive switching between disjoint systems to search for information and ineffective existing
search tools.
Scenario summary
Key requirements • Index each individual data silo keeping content secure.
• Create standard default UI for data access.
• Create custom interfaces for internal and external users.
• Secure content must only be accessible to users
authorized to see it.
• Deployment must result in a measurable business
benefit.
Project Scenarios 39
Project plan
The following figure shows a generalized gantt chart for deploying internal search over
intranet, file system, SharePoint, and Notes.
Internal search over intranet, file system, SharePoint, and Notes project plan
Enhancements
The following figure shows a generalized gantt chart for enhancement phases for deploying
internal search over intranet, file system, SharePoint, and Notes.
Enhancement phases
Their internal HR system is a large database repository and they have various commercial and
custom applications that allow users to gain access to data based on different access
methods. Directory information (for example contact details, manager and direct reports),
performance reports, and salary information are stored on this system.
Key requirements • Index each individual data silo, keeping content secure.
• Create standard default UI for data access.
• Create custom interfaces for different groups in the
organization.
• Secure content must only be accessible to users
authorized to see it.
Chosen approach • Deploy to initial pilot group prior to full rollout by means
of a corporate portal.
• Crawl and serve secure content (for example, HR or
Salary information) using LDAP.
• Manage security at the application level.
• Initially index a selected cross-section of data holding
with additional documents to be added later.
• Present results directly from the search appliance by
using the default XSLT style sheet.
• Due to the diversity of content and source, use a phased
approach for deployment.
• Intranet sites and file share will be in initial
deployment.
• Database feeds for Oracle HR system will follow.
• CMS systems and related portals will be next.
• A survey of corporate applications that house and
serve data will be conducted and a determination
will be made on which will be accessed for search.
Project Scenarios 41
Possible architectures • Federated high availability deployment architecture with
disaster recovery capability—Search integration of
disparate data stores with the need of replicating
indexes across different departments/groups while
ensuing virtual 24/7 up-time so that productivity is not
lost.
• Implementation of integration architectures:
• Content and Metadata feeds for CMS
• Custom connector for database search
• Implement Kerberos security to limit access to secure
information—all users will have network accounts which
complements using integrated authentication and
authorization with Kerberos.
• Implement SAML SPI Deployment or policy ACL
deployment to handle diverse security or poorly
performing systems.
Project plan
The following figure shows a generalized gantt chart for deploying internal search including
CMS, database corporate application assets.
Internal search including CMS, database corporate application assets project plan
Enhancements
Multiple short iterative enhancement phases deliver incremental functionality, delivering new
content to your users, and allowing opportunities to increase visibility and drive uptake with
new users. The following figures show gantt charts for enhancement phases.
You can begin by including unsecured database content by means of a database feed, crawl
any additional content that still needs to be acquired, and then release to the primary user
groups.
Now that your users are searching across their information, the next phase is to rapidly build a
method to feed content from your CMS to the Google Search Appliance.
And finally, you can begin to consume content from your corporate applications, in short,
phased migrations. These can be planned and repeated as needed to deliver true Universal
Search. Note that phases may have longer durations where security integration is required.
Project Scenarios 43
44 Google Search Appliance Deployment Guide
Deployment Architecture Chapter 5
This chapter discusses the following technical and architectural considerations for planning
your deployment:
For examples of architectures that address common deployment scenarios, see “Deployment
Scenarios” on page 61.
45
Scoping index capacity needs
Google Search Appliance models are:
• GB-9009—can index up to 30 million documents out of the box. For larger deployments,
multiple GB-9009 appliances can be linked together to search hundreds of millions or
even billions of documents.
From a sizing perspective, Google recommends that organizations choose a base unit that
meets the current document capacity needs, as well as projected document growth needs for
two years.
However, because upgrading requires a hardware change, if the current document capacity is
close to the physical indexing limits of the GB-7007, Google recommends selecting the GB-
9009 to simplify management of the solution over time.
The Google Search Appliance is also designed to operate intelligently up to the license limits
within each respective model to ensure the most optimal user experience. When the license
limit is reached in a given model, the search appliance continues to discover relevant
documents outside the license limit in an effort to maintain a servable index of the most
relevant documents found in the environment. However, this will create churn while less
relevant documents are removed in favor of more relevant ones. If your search appliance is
nearing its license limit, consider upgrading to a higher document count.
This process of continual discovery and analysis beyond the license limit provides an
automated and intelligent method of managing the search experience when operating in an
environment where more documents are available than the license limit allows.
However, the search appliance’s automated pruning logic could cause certain critical content
to be excluded from the index to make room for more relevant content. If mission-critical
content exists beyond the license limit, Google recommends expanding the license limit to
ensure that all the relevant content can be indexed and served with additional room to grow.
For a discussion of the choice between a upgrading search appliance upgrade and deploying
additional hardware, see “Scale up/scale out” on page 49.
Dynamic scalability
Dynamic scalability is a release 6.0 feature that enables multiple Google Search Appliances to
work together in a federated environment to scale up to as many documents as you wish to
search in a unified manner.
In a dynamic scalability configuration, one search appliance is the primary node and the
others are secondary nodes. The primary search appliance aggregates results from all of the
search appliances in the configuration and serves them to the search user. The primary
search appliance's front end is used for searching all document corpora in the dynamic
scalability configuration.
• Corporate websites
• Partner extranets
• Portals
• Knowledge bases
• File shares
This process might seem straightforward. However, you might uncover more information
within a given content source that needs to be indexed than you originally anticipated. For
example:
The Google Search Appliance provides capabilities to limit content indexing by implementing
simple content acquisition rules (follow and crawl URLs). Limiting the index scope by adjusting
these rules can be an ongoing discovery process that needs to be taken into consideration,
especially when the content sources targeted for search are not well-maintained or are
managed in a decentralized fashion.
The search appliance provides detailed logs on each document that has been indexed and
also provides summary information on document types and sizes. Crawl Diagnostics features
allow administrators to fine tune follow and crawl URLs to ensure that the most relevant
content is being indexed and served at any time.
Deployment Architecture 47
Determining how to index
Once you have identified content sources to index, you need to take the method of indexing
each one into consideration. The Google Search Appliance can use several methods to
acquire content for indexing, including:
• Web crawl
• Database synchronization
Determining the most effective method of indexing depends on the content sources that need
to be indexed.
For example, corporate websites, partner extranets, wikis, corporate intranet sites, and
informational portals can often be easily indexed by using the search appliance's crawling
technology. The crawl process issues HTTP requests or follows links to locate content on a
website or file system. To configure crawling, an administrator follows a simple process of
defining URL rules in the search appliance's simple-to-use web-based Admin Console.
For comprehensive information about crawl, see “Administering Crawl for Web and File Share
Content at http://code.google.com/apis/searchappliance/documentation/60/admin_crawl/
Introduction.html.
For information about database synchronization, see “Database Crawling and Serving” at
http://code.google.com/apis/searchappliance/documentation/60/database_crawl_serve.html.
For integration into Content and Document Management Systems, knowledge bases, and
collaboration tools, such as Microsoft SharePoint, using feeds or a connector might be the
most effective method for indexing content. For more information, see “Feeds” and
“Connectors” on page 55.
For complex deployments where search spans multiple information sources, consult a Google
product specialist or Google Enterprise partner to determine the optimal methods of indexing.
Similarly, organizations might need to support additional users and search loads over time. To
accommodate this type of growth, you need to architect for performance, as described in
“Load balancing” on page 49.
For example, if you require 25 million documents to be indexed, should you use one GB-9009
or three federated GB-7007s? The answer to this question is dependent on a number of
factors, including, but not limited to the following issues. These items are in no particular order.
The factors important to your deployment may be completely different from those important to
another deployment.
• How much rack space is available or are there power restrictions in the data center?
• For limited rack space or power restrictions, Google recommends that you choose a more
powerful search appliance model instead of multiple, federated search appliances.
• Is a hot backup required (increased cost for more servers)? Each hot backup has a fixed
cost, so if you require multiple hot backup servers, the cost might be greater than one
individual, but larger, search appliance. Alternately, a deployment made up of many lower
capacity servers might be more costly than a single larger one. Therefore, investigate this
issue before deciding on the type and number of search appliances that will be used in the
solution, as total procurement cost, including production and hot backups may be subject
to change.
• Are there multiple departmental owners who want to control their own search service? In
some instances, individual content owners prefer to own their own search appliance. If this
is the case, then a dynamic scalability configuration using multiple search appliances
would be the solution.
To read about deployment scenarios that use dynamic scalability, see “Federated
architecture” on page 79.
Load balancing
Load balancing distributes network traffic of a particular type to two or more instances of an
application, dividing the work load between the instances. A load balancer is a software or
hardware application that distributes the network traffic. When you configure two or more
Google Search Appliance systems for load balancing, search queries are distributed between
the two systems.
Deployment Architecture 49
Determining whether a load balancer is required is dependent on a number of considerations,
such as:
• The peak load of queries that the search appliance will receive
A large number of queries per second at peak time, or a very diversely located user base,
generally requires multiple search appliances using load balancing to help serve the results at
an acceptable rate. Load-balanced search appliances also provide a level of redundancy that
is not possible with a single search appliance.
• A single search appliance on a network with no other search appliance for failover or fault
tolerance. This is not a load-balanced configuration.
• A load balancing configuration in which there is a physical connection between the search
appliances and the load balancer and each search appliance is on the same network or
subnet as the load balancer.
• A load balancing configuration in which there is a logical connection to the load balancer
and each search appliance is potentially on different networks or subnets from the load
balancer.
• A failover configuration in which a switch fails over search queries from the search
appliance that normally responds to search queries to a search appliance that does not
normally respond to search queries and is used only for failover. For more information,
see “Failover configurations” on page 51.
Note: In each of the above configurations, each search appliance could be one or more
search appliance in a federated deployment.
Load balancers can be used with virtually any architecture, such as Federated high availability
deployment architecture, described on page 80.
Google does not recommend specific load balancers to use with the search appliance. The
configurations described in this document are expected to work with any equipment that
complies with networking RFCs.
To read about deployment scenarios that use load balancing, see “High availability
architecture” on page 65.
For information about load balancing, see “Configuring Search Appliances for Load Balancing
or Failover” at http://code.google.com/apis/searchappliance/documentation/60/configuration/
Configuration.html.
For any application where Google Search Appliances are providing mission-critical search
capabilities, Google recommends a high availability configuration to provide seamless
operation in the event of a system failure.
Failover configurations
Failover configurations typically involve two instances of an application or a particular type of
hardware. The first instance, sometimes called the primary instance, responds to search
queries. If the first instance fails, the second instance, sometimes called the secondary or
standby instance, starts responding to search queries.
One such implementation is a domain name system (DNS) switchover configuration that
provides a redundant "hot spare.” This configuration involves multiple search appliances,
where one is used in production and the second one is used as a hot spare. These search
appliances can be located anywhere, physically or logically.
The DNS switchover may be automatically executed in the event of a failure. This switchover
can be executed manually, but it typically results in a more extended outage, due to the need
to wait for manual execution, and, depending on your environment, time for DNS changes to
propagate.
Changes are made in DNS to restore the search if the primary search appliance becomes
inaccessible. This setup is only used for redundancy (or failover) and does not provide a
method of load balancing.
To read about deployment scenarios that use failover, see “High availability architecture” on
page 65.
For more information about failover, see “Configuring Search Appliances for Load Balancing
or Failover” at http://code.google.com/apis/searchappliance/documentation/60/configuration/
Configuration.html.
However, some organizations have needs to scale beyond one search appliance. For these
instances, multiple search appliances can be deployed in parallel to scale in a linear fashion
for query volume.
Deployment Architecture 51
• An active/active setup, in which two appliances are set up and serving results concurrently
• An active/passive failover setup for fault tolerance, in which two search appliances are set
up, with one serving results and the other to be used only in the event of a failure on the
primary search appliance
Not all content can be accessed and discovered by crawling. To make sure that this content is
in your index and searchable, you might need to use the following integration technologies:
OneBox modules
The name "OneBox" refers to the search box that provides access to information from many
sources. OneBox also refers to the formatted output that appears in response to specific query
keywords. OneBox modules are a powerful tool at your disposal for increasing the breadth of
content in your search deployment.
The following figure shows the OneBox module that appears when a user searches for
“finance.”
OneBox Modules enable a Google Search Appliance to integrate with third-party systems in
real time. OneBox modules supplement Google’s powerful algorithmic search with purpose-
built, targeted data retrieval. They enable the search appliance to display this information to
users in the same context as their algorithm-driven search results.
2. Configuring the search appliance so that it is aware of the service and knows when to call
to it.
When the search appliance receives a query that the OneBox can help with, it passes the
query to the OneBox service provider, which extracts the information from a third-party system
and returns it to the search appliance as XML.
To read about a deployment scenario that uses OneBox modules, see “OneBox integration”
on page 71.
For more information about OneBox modules, see the following documents:
Deployment Architecture 53
Feeds
Feeds are used to push data into or delete data from the index on a Google Search Appliance.
To push content to the search appliance, you require a feed and a feed client:
• A feed is an XML document that tells the search appliance about the contents that you
want to push.
• A feed client is the application or web page that pushes the feed to a feeder process on
the search appliance.
• Web feeds
• Content feeds
• Metadata-and-URL feeds
Web feeds
A web feed provides the search appliance with a list of URLs and possibly some metadata.
Web feeds might be used in the following cases:
• A list of URLs pulled from a database that can be fed into the search appliance in order to
continue to crawl them
• Pushing URLs to the search appliance, from a CMS that is HTTP accessible, when they
are published
• Any list of URLs that you want the content recrawled periodically but don’t want to enter
them in the administration console as start URLs
Content feeds
A content feed provides the search appliance with both URLs and their content. Content feeds
can be full or incremental. Content feeds can be used in the following cases:
To read about deployment scenarios that use feeds, see “Feeds integration” on page 72.
Metadata-and-URL feeds
Metadata-and-URL feeds can be used to provide additional metadata to the Google Search
Appliance. This metadata can be used for searching and for filtering search results. This type
of feed is commonly used in the following cases:
For more information about feeds, see the “Feeds Protocol Developer’s Guide” at http://
code.google.com/apis/searchappliance/documentation/60/feedsguide.html.
Connectors enable the Google Search Appliance to search and serve documents stored in
non-web repositories such as enterprise content management (ECM) systems. Connectors
are installed on a host running Apache Tomcat. A Google Search Appliance that uses
connectors can perform fast, unified, secure search across multiple systems and document
repositories.
Connectors typically also handle serve-time authentication and authorization for the
repositories to which they connect.
Connectors implement an Open Source set of interfaces This means that in addition to the
four out-of-the-box connectors (listed in the following table), you can extend the reach of your
deployment with custom connectors for whatever content source you need.
To read about deployment scenarios that use connectors, see “Connector integration” on
page 73.
Deployment Architecture 55
For more information about individual connectors, see the documents listed in the following
table.
IBM FileNet “Configuring the Google Enterprise Connector for FileNet (3.5)” at
http://code.google.com/apis/searchappliance/documentation/connectors/
200/connector_admin/filenet35_connector.html and
“Configuring the Google Enterprise Connector for FileNet (4.0)” at
http://code.google.com/apis/searchappliance/documentation/connectors/
200/connector_admin/filenet4_connector.html
Open Text “Configuring the Google Enterprise Connector for Open Text Livelink”
Livelink at http://code.google.com/apis/searchappliance/documentation/
connectors/200/connector_admin/livelink_connector.html
A single Google Search Appliance can easily serve public and secure content together, or as
separate collections, and can handle a mix of different authorization schemes. A typical
intranet deployment has a mix of public and secure content on many different servers.
For comprehensive information about the search appliance and security, see “Managing
Search for Controlled-Access Content” at http://code.google.com/apis/searchappliance/
documentation/60/secure_search/secure_search_overview.html.
Alternatively, secure content can be flagged as "public" and it will be included in search results
for all users—but this content may not be accessible when the user clicks a link in the search
results.
The most common way to authenticate users is against LDAP (or Active Directory). In the
simplest case, a user is prompted for her username and password the first time she searches
on the Google Search Appliance. This authentication establishes a secure session on the
search appliance itself, and the user is not prompted again while the session is active.
Once a user's identity is established, the search appliance can use it to determine which
resources she has access to.
The search appliance executes the search, generates a result set, and then, for any results
that are flagged as private, performs authorization checks. Authorization is checked for one
page of results at a time (usually up to 10 or 20 results), not the entire result set.
In this scenario, the search appliance performs the authorization check by issuing a request
for the document with the user's credentials. The document isn't retrieved, but the search
appliance checks the HTTP response code and, if it is valid, allows the document to be
presented in the results.
Deployment Architecture 57
Other access-control mechanisms
In addition to NTLM and HTTP Basic, the Google Search Appliance can also work with the
following access-control mechanisms:
When the search appliance is configured to use IWA/Kerberos authentication, it checks the
user's session ticket against a Kerberos Key Distribution Center (KDC) before displaying
secure search results to a user. For Windows servers, the domain controller acts as the KDC
for IWA/Kerberos authentication.
• If a user has a valid ticket, he can see secure search results without having to log in again.
• If a user does not have a valid ticket, or the search appliance is unable to perform
Kerberos authentication, the search appliance prompts the user for his credentials using
HTTP Basic or NTLM HTTP.
SAML SPI
The SAML Authentication and Authorization Service Provider Interfaces (SPIs) enable a
Google Search Appliance to communicate with an existing access-control infrastructure by
means of standard Security Assertion Markup Language (SAML) messages. The
Authorization SPI is also required to support X.509 certificate authentication during serve.
To read about a deployment scenario that uses the SAML SPI, see “SAML SPI deployment”
on page 76.
For more information on search appliance configuration for use with these SPIs, see “The
SAML Authentication and Authorization Service Provider Interface (SPI)” at http://
code.google.com/apis/searchappliance/documentation/60/secure_search/
secure_search_crwlsrv.html#the_saml_authentication_and_authorization_service_provider_interfa
ce_spi_.
To learn more about the Google SAML Bridge for Windows, see “Enabling Windows
Integrated Authentication” at http://code.google.com/apis/searchappliance/documentation/50/
admin/wia.html.
Policy ACLs typically store the results that would have occurred if the search appliance
initiated a HEAD request to verify authorization. However, policy ACLs can also be used to
override the decision that would have been returned by a HEAD request.
For example, if you put in a policy ACL rule that permits a group to see all documents at a
URL, but at the source repository (that is, the HEAD request), there's a more fine-grained rule
where only some in the group can view documents, then the behavior with the policy ACL rule
is that everyone can see the search results, but only those who have access rights can click
the links.
Policy ACLs can be an effective way to improve serving of results by carrying out authorization
checks more effectively. However, when making the decision to use policy ACLs, take into
account that you will need to manage synchronization to ensure that the latest security policies
are pushed to the search appliance.
You will also need to have a method for the search appliance to understand groups and user
identifiers. If you do not have an LDAP server configured to provide this information, then you
need to push it to search appliance by means of GData feeds.
Policy ACLs require that you use an authentication method to establish the identity of the user
or group that you specify in the policy ACL rules.
To read about a deployment scenario that uses policy ACLs, see “Policy ACLs deployment” on
page 77.
For more information on policy ACLs and secure search, see “Policy Access Control Lists” at
http://code.google.com/apis/searchappliance/documentation/60/secure_search/
secure_search_crwlsrv.html#PolicyAccessControlLists.
Deployment Architecture 59
Enhancing technologies
The Google Search Appliance delivers core capabilities in a self contained appliance model.
However, these capabilities can be supplemented by additional, non-core technologies.
• JavaScript
• Images
• OneBox modules
• Connectors
In some cases, off-box technologies may be used to work around non-compliant content
systems, or to enhance or enrich content. This approach typically takes the form of having the
Google Search Appliance crawl through a custom proxy, and might include:
Third-Party tools/connectors
Integration with some content systems is typically achieved through connectors or feeds. In
many cases, it is quicker, easier, and may be cheaper to pay a third party for a pre-built
connector. These connectors typically take the form of either implementations of the Google
Connector API, or a feed/SAML SPI combination. If you need integration to a third party
system, you should make a buy vs. build evaluation if prebuilt solutions are available. You can
find these solutions at the Google Solutions Market Place at http://www.google.com/enterprise/
marketplace/.
This chapter contains descriptions of architectures that address the following common
deployment scenarios:
61
Staging/Development environment
In most environments, it is advantageous to test changes in a separate environment before
releasing changes to end users. As with any type of server or application, a small change to a
configuration could have unintended consequences, so a proper testing strategy and staging
environment is recommended.
A development environment for the Google Search Appliance simply means replicating the
production environment to provide a separate area for testing configuration changes and new
enhancements. The development environment should include access to the same content
types and sources as the production environment, but it may include a restricted or reduced
set of documents.
A common setup includes a non-production search appliance that does not serve results to
most end-users. All configuration changes, updates, and enhancements are tested on this
search appliance, and then pushed to the production search appliance(s) when ready.
Staging/Development environment
This is a recommended deployment architecture for all deployments. Where multiple search
appliances are deployed in production using features such as index replication or federation,
Google recommends that this is also reflected in development.
Where possible, avoid using hard-coded naming conventions that might complicate migrating
configurations. For example, dev_collection or test_frontend would need to be renamed to
move to production.
Simple architecture
In the simplest deployment scenario, a single Google Search Appliance can function by itself
to provide search results directly to end users. While a single server does provide some
redundancy (RAID, dual power supply), there are still many points of potential failure.
Google recommends deploying a single search appliance only when downtime or service
interruptions can be tolerated. A small company or departmental implementation may not have
a critical need for 99.9% uptime, but for mission-critical search applications where many
people depend on the availability of search, Google recommends an architecture that offers a
greater degree of operational continuity.
Simple architecture
This architecture is appropriate for non-critical systems. It is simple, inexpensive, and easy to
configure and maintain. However, it lacks redundancy, has multiple points of failure, and is not
consistent with best practices for critical systems.
Deployment Scenarios 63
Search as a web service
In this scenario, the Google Search Appliance is used as a search service. The search
appliance delivers its results as XML to a web server that directs the user experience. This
scenario is particularly important for public websites that employ page templates and inherited
stylesheets.
In this architecture, the users never interact directly with the search appliance. Instead, their
searches are intercepted by a component on another website, such as a servlet or custom
control, proxied to the search appliance, and transformed into HTML in the web server.
In this architecture, a primary website maintains control over the search experience (such as
stylesheets, page templates and inherited characteristics). Multiple searches can be executed
on a single results page (such as apple.com or reuters.com). However, secure search is
considerably more complex, due to the intermediary web server.
Deployment Scenarios 65
High availability is a standard architecture for customer- or partner-facing websites where
search plays an integral role in the overall user experience, and for large-scale enterprise
deployments.
In this architecture, at least two Google Search Appliances are online and available to serve
results in case one unit fails. The number of search appliances required depends on the query
volume:
• If one search appliance can handle peak query volume, only one production search
appliance and one hot backup are required.
• If two search appliances are required to handle peak query volume, then two productions
search appliances would be required along with one hot backup. This configuration is
known as N+1 Redundancy.
The Google Search Appliance itself has no utility to manage failover or load balancing—an
external load balancer handles this function.
Each search appliance needs to crawl and acquire content independently. Index replication, a
Google Search Appliance, release 6.0 beta feature can also be used to keep multiple search
appliances in sync. For information about index replication, see “Configuring Distributed Crawl
and Index Replication” at http://code.google.com/apis/searchappliance/documentation/60/
dist_crawl/dist_crawl.html.
The other alternative is to export the configuration from the master search appliance and
import it into the secondary servers. This process can be automated by using the
Administrative API, or the Google Search Appliance admin toolkit.
For information about the Administrative API, see Google Search Appliance development
documentation at http://code.google.com/apis/searchappliance/documentation/60/index.html. For
information about the Google Search Appliance admin toolkit, see http://code.google.com/p/
gsa-admin-toolkit/.
For more information about high availability and load balancing, see “Architecting for reliability”
on page 50.
In most applications, the load balancer acts as a frontend and forwards requests to the
backend search appliances.
If users are going to be searching against secure content, configure the load balancer to
handle persistent ("sticky") sessions. Otherwise, the users may be prompted to re-
authenticate. Persistent sessions shouldn't be required if authentication is handled by SSO,
Integrated Windows Authentication/Kerberos, or if there is only public content. This also
ensures consistent search result pagination across the session.
To run health checks on the backend servers, configure the load balancer. A simple ping test
can monitor network connectivity, but fails to detect application-level failures. An ideal health
check is to query the backend search appliances with a real search term and check the
response.
Deployment Scenarios 67
Typical load-balanced deployment
In a typical load-balanced configuration, two Google Search Appliances are physically
connected to a hardware load balancer or are located physically downstream. This setup is
used for increasing serving capacity. Both search appliances need to perform crawls, unless
index replication is used.
Also, with a load balancer, there is a single point of hardware (load balancer) failure. Upon
hardware (load balancer) failure, physical access is required to restore search service,
because either the IP address of the search appliances must be changed, or the load balancer
must be fixed.
However, this deployment requires a special load balancer, which supports balancing or traffic
proxying to external virtual IPs. It also requires more complex ACLs, because rules for
additional IPs must be created. And network traffic due to queries is doubled between the load
balancer and the switch.
Google Search Appliances can be deployed into such an architecture to provide the same
level of disaster recovery for search capabilities. Essentially, the high availability architecture
described on page 65 is deployed at the primary datacenter. The same configuration can be
mirrored in a redundant datacenter where the search appliances are configured to crawl and
index the same content within their respective datacenters.
Deployment Scenarios 69
This model parallels much of how existing systems and servers would be mirrored between
primary and backup datacenters for global redundancy. In the event of a disaster, this
configuration relies on the existing failover mechanism to divert traffic to the backup
datacenter where the search appliances are online and ready to respond to requests.
OneBox integration
One of the simplest to implement and powerful forms of search integration is the OneBox
module, because it enables retrieval of structured, current information whenever a user
searches.
Typically, you deploy this integration as a lightweight Java Servlet, an active server page
(ASP), or module scripting language, such as Python or PHP. This integration is deployed to a
web server or virtual server, and extracts data from the content source as queries are
received.
OneBox architecture
Deployment Scenarios 71
OneBox integration is a powerful solution for adding incremental data in a rapid deployment
cycle. OneBox modules can be used to supplement algorithmic search results with non-
algorithmic data. OneBox modules are quick to deliver and deploy.
However, OneBox modules need to perform well by returning search results in under three
seconds. (If it takes more than three seconds, the OneBox module does not appear with the
search results.) Also, OneBox modules are not appropriate for large volumes of data (more
than eight results).
Feeds integration
Feeds can be pushed to the Google Search Appliance to enrich content with metadata or get
content into the index that the search appliance cannot discover through crawling. Most
commonly, a feed pushes the following types of content:
The feed server extracts data from a CMS and other applications, formats it into XML, and
feeds it into the search appliance, where it is added to the index and made searchable. A feed
can also be used to provide a list of URLs for public-facing content that is served from behind
JavaScript.
Where the feed is a metadata-and-URL feed, the search appliance still needs to be able to
crawl and access content, If the search appliance is not able to do this, use a content feed.
Feeds are an effective means of integration and of increasing breadth of scalability. The feeds
API is powerful and can handle high volumes of content.
However, if you are using feeds to achieve more rapid acquisition of content, consider using
aggregation design patterns that group feeds into small batches, rather than high-volume
individual documents.
Connector integration
Connectors enable indexing and query-time connections between a Google Search Appliance
and non-web repositories, such as Enterprise Content Management (ECM) systems. A
connector instance traverses a document repository and feeds document data to the search
appliance for indexing. At query time, connectors forward authentication credentials and
authorization requests to the repository.
The key implementation detail in the connector architecture is the connector manager. The
connector manager simply provides an environment for the connectors to run within. The
connector manager also saves various configuration and state parameters for each connector
instance. A single connector manager can manage multiple connectors for multiple search
appliances.
Deployment Scenarios 73
The connector manager does not run on the Google Search Appliance—it must run on a
separate server. The connector manager is fairly lightweight however and it is not usually
necessary to host it on a dedicated server.
Connector architecture
A typical connector server can be deployed on a lightly provisioned server. 2GB of RAM and
20GB of disk should be sufficient in most cases, although you should evaluate your own
specific requirements. The connectors can also be deployed on a virtual machine (VM).
The Google Search Appliance is configured with appropriate credentials to crawl and acquire
only the content you want to serve. When a user searches secured content, her user
credentials are checked to see if she is authorized to view content at serve time—ensuring
that users never see content that they are not entitled to view.
The following figure illustrates secure HTTP and file server content architecture.
Deployment Scenarios 75
A secure solution is relatively straightforward to configure. However, depending on your
authentication methods, the solution might require users to enter multiple sets of credentials.
A SAML provider is deployed and the Google Search Appliance is configured to be aware of it.
Content is crawled where possible and fed to the search appliance by standard content feeds
where content is not accessible.
When a user searches secured content, he is authenticated against the SAML SPI. This is
responsible for obtaining all necessary identities and authorizing the user against each content
source to make sure that only authorized content is displayed.
If the user searches content that is not one of the systems for which the SAML provider is the
security provider, the search appliance reverts to one of the other security protocols (HTTP
Basic, NTLM, connector authorization, and so on).
Policy ACLs enable the search appliance to perform security checks more efficiently. This
method is particularly useful when either the network or your content repositories are slow and
may not support real-time authorization.
Access permissions are fed to the search appliance using the Policy ACL API either at crawl
time, or as needed.
The Google Search Appliance uses the policy ACLs where they are defined. Where they are
not defined, then one of the other security methods is used, or if content is not secured, it is
served unsecured.
The following figure illustrates public and secure policy ACLs architecture.
Deployment Scenarios 77
Public and secure policy ACLs architecture
For more information, see “Access control list caching” on page 59.
Results can be federated together from multiple nodes into one result set. It is more difficult to
provide redundancy for individual nodes in this scenario, because the federation mechanisms
currently do not provide any way to deal with failover on an individual per-node basis.
Typically, this will be achieved by configuring an identical failover deployment for high
availability. Index replication cannot be used with federation.
For more information, see “Architecting for scale and performance” on page 48.
Deployment Scenarios 79
Federated high availability deployment architecture
In many cases, the architecture and implementation of a Google Search Appliance and search
solution is simple. However, an implementation can also become much more complex as you
begin to use different combinations, such as federated, high-availability installations in
geographically diverse data centers. This complexity can come in the form of multiple
federated search appliances located in multiple locations around the globe, indexing content
from multiple repositories.
This is a more complex architecture which shows the use of an application layer for
presentation, federated search appliances, using connectors and feeds, replicated in two
global data centers for disaster recovery.
Complex architecture
This example ties together many of the concepts from the previous examples to create
redundant system with global scope. In this case, there is a global company, with most of their
content based in North America. At the core there are eight federated GB-9009s. By
federating multiple Google Search Appliances, administration can be split across multiple
administrators or departments within the company. A smaller network of three Google Search
Appliances indexes European content, and a load balancer splits traffic between the two sites.
• The eight North American Google Search Appliances could be split into two clusters of
four search appliances each, and then load balanced for capacity or redundancy.
• The European Google Search Appliances could be federated together with the North
American search appliances.
For more information, see “Architecting for scale and performance” on page 48.
Deployment Scenarios 81
82 Google Search Appliance Deployment Guide
Post Deployment Chapter 7
Because the search solution is a core business system, you need to ensure processes are in
place for appropriate maintenance and management of it. After you successfully deploy your
search solution, transition it to Business As Usual (BAU). Because the Google search solution
is flexible and standards-based, post deployment can continue to be a period of evolutionary
growth and refinement.
The following sections discuss post-deployment best practices for a Google search solution:
Update planning
Google releases regular software updates to the Google Search Appliance about twice a year.
You are entitled to deploy any updates throughout your support term. When a new update is
released, consider updating your search appliance.
Google will notify you of any major release (such as the 6.0.0 release), but check the Google
Enterprise Support site (http://support.google.com/enterprise, password required) regularly for
release information. You can also contact your Google representative or Google Enterprise
partner to discuss updating your search appliance.
This section contains information about the following best practices for update planning:
This section also contains information about major releases (on page 85) and update releases
(on page 86).
83
Software release versions
Software release version numbering for the Google Search Appliance follows a consistent
format, as shown in the following example:
5.2.0.G32-P1
This can be read as software version 5, point release 2, update release G32, VM Patch 1.
Update releases are discussed in more detail on page 86.
To find out what version your search appliance is running, use the Google Search Appliance
Version Manager at:
http://<your-search-appliance>:9941
To access the Version Manager, you need to log in as the admin user (no other user – even
one configured as an administrator can access the Version Manager).
In general, Google recommends that you deploy the latest G release for your software version.
The release notes can help you to understand what search appliance behavior may have
changed and what testing you may want to carry out. Use the release notes to help you
determine if there are any specific challenges that an update can help you resolve. Also, look
also for any open issues that you need to be aware of and plan for.
You can find the release notes for all current and recent software versions at https://
support.google.com/enterprise/doc/gsa/00/update_index_page.html (password required).
Read the update instructions and understand the sequence of events. The update instructions
provide the steps for executing the update, but you need to plan appropriately to manage the
business impact.
You can find the update instructions for all current and recent software versions at https://
support.google.com/enterprise/doc/gsa/00/update_index_page.html (password required).
You also need to remember the key phrase for full configuration files. Placing this in the check-
in comments in your version control system is a useful practice.
Major releases
The process of updating your search appliance is simple. It consists of the following tasks:
1. Downloading the update binaries from the Google Enterprise Support site.
2. Uploading the update binaries to the search appliance by using the Version Manager.
Plan and execute the update as you would for any enterprise application.
Because the binaries are large, you might want to stage them on a local server so that the
Google Search Appliance can access them without leaving the LAN. It is important to use this
approach if the search appliance does not have external access. Always check the MD5
hashes before uploading binaries to your search appliance.
This process is not usually problematic, but factor it into your migration plans. The main impact
of this scenario is that it can add a couple of hours to the update time and it may mean that
you cannot take advantage of index migration, if available.
Update production
Typically, major updates require rebuilding the index to take advantage of new features. This
means that you need to factor this process into your plans. As of version 6.0, the Google
Search Appliance enables migration of the index also, so that the search appliance does not
need to recrawl content to get it back into the index. Check the release notes and update
instructions to see if this feature is available in the software version to which you are updating
your search appliance.
Post Deployment 85
If you plan to rebuild your index by crawling, include time in your update schedule and crawl
schedule to allow the search appliance to re-acquire content without placing excessive load on
your content servers. This may prolong the update process, so if possible, you should try to
use index migration.
Update sequence
The following steps outline the sequence for updating Google Search Appliances:
1. If you have a hot backup search appliance, update this first and execute regression tests
there.
2. Once the hot backup is updated, you can revert to serving users from your backup
server(s) while you update your production server(s).
3. Update your production servers either one at a time, or in parallel, so long as you have
alternate serving capabilities available. Alternatively, if you have a scheduled maintenance
window during which outages are planned, you could make use of it.
To handle updates without disrupting full capacity, consider having an additional node in
production in conjunction with your Google Search Appliance or load-balanced search
appliances.
• Test a new installation and if you migrated the index, test the index migration.
• Acquire content in the upgraded index while allowing the old index to continue serving.
This is a powerful tool for migrating with minimal disruption.
Update releases
In addition to the regular primary releases, Google release smaller update releases. Update
releases are marked as G releases, such as release 6.0.0 G32. Check the support site at
https://support.google.com on a regular (once every month or two) for update releases.
Update releases are low-impact and aimed at delivering specific enhancements or addressing
specific issues. As a rule of thumb, you should consider ensuring that you are at the latest G
release available.
As usual before updating, read the release notes and update instructions.
The Google Search Appliance is licensed for either two or three years use with full support.
Renewing a search appliance is not usually a complex process, but it does require planning. In
most cases, the process is similar to updating, described in “Update planning” on page 83.
However, the search appliance needs to re-acquire content and you need to plan for
deployment of physical search appliances.
However, the GB-5005’s (4-10 million documents) and the GB-8008’s (15–30 million
documents) have been replaced by the following new, more powerful units, with greatly
reduced form factors:
• GB-7007—The GB-5005 has been replaced by the GB-7007, which is a 2U unit. You will
no longer require special power configurations, such as the 15 Amp power supply.
• GB-9009—The GB-8008 has been replaced by the GB-9009, which consists of two units:
a 2U appliance, and a 3U storage module totalling 5U. Each node will require a power
supply, but you will no longer require special power configurations, such as the 15 Amp
power supply.
These units ship pre-configured in their own mobile rack unit. For search appliance physical
specifications, see “Planning for Search Appliance Installation” at http://code.google.com/apis/
searchappliance/documentation/60/planning/planning.html.
Google recommends procuring your search appliances far enough in advance to acquire
content before the planned renewal date. Be sure to consider all content acquisition methods,
including:
Post Deployment 87
• Web crawl
• Database synchronization
If you are using connectors, you may need to run parallel deployments for a short period to
ensure that the content is acquired by the new search appliance without affecting the existing
deployment. You need to plan sufficient infrastructure for running parallel deployments.
DNS configurations • Ensure that host names will resolve to their new IP
addresses correctly and that you understand how long it
will take DNS changes to propagate. This is particularly
important for public-facing search being served directly
by the search appliance, where DNS management is
much less predictable.
Disaster recovery/hot backup • Review, and if need be, update your scripts and
processes to ensure that failover will be smooth, and
business continuity achieved. If possible, you should test
disaster recover failover shortly after renewal.
Execute cutover
You should execute cutover to the new search appliances during a time of limited user activity.
It is recommended that you communicate actively to users. You should let them know:
To smooth migration of content, explore using index replication, introduced in software release
6.0.
Enterprise Support might require you to update your search appliance to a more recent
software release if it is on an older release. Another reason that support might ask you to
update your search appliance is so that you can take advantage of bug fixes that have been
implemented in more recent releases. Updating to a standard release enables you to get bug
fixes without having to deal with various patch releases.
Google Enterprise support engineers provide support and troubleshooting for core Google
products (the Google Search Appliance, connectors, and so on).
On occasion, Google Support Engineers require remote access to your search appliance to
troubleshoot issues.
Post Deployment 89
Method Description Notes
When you contact Enterprise Support, provide the following information to help resolve your
issue:
• Remote access details for chosen methods (SSH configuration and routing, support call,
and so on)
• License information
• Detailed description of the problem, including error messages, screenshots, actions taken,
and so on
Premium support
You can purchase premium support from Google. Premium support entitles you to 24/7 pager
support, and improved service-level agreements (SLAs). Premium support also includes a
secondary search appliance that must be deployed with the same configuration as the
production search appliance.
Disconnected support
When providing disconnected support, Google support does not have remote access to the
search appliance. You can purchase disconnected support, if required, with approval from
Google support. It is recommended that you explore all other support options before pursuing
this option.
Additional support
Google Enterprise Support does not support broader deployment issues, such as custom
development supplementing the Google Search Appliance. You can purchase this type of
support from certified Google Enterprise partners in the Google Solutions Marketplace at http:/
/www.google.com/enterprise/marketplace/.
• Understand the business value and criticality of your search application. It is much easier
to assign a business value and priority to search if you know how it is benefiting users.
• Understand what your users are searching for and whether they are finding it effectively.
Insights that you gain will help you understand which features to use, and how to use
them. Giving your users a great search experience increases user satisfaction and
therefore the overall success of the solution.
• “Using core capabilities to help users find content more efficiently,” which follows
The Google Search Appliance provides an analytics feature, advanced search reporting
(ASR), that captures detailed information about user search and navigation activity. ASR can
be activated easily through the search appliance web-based Admin Console. Analytical
information can then be extracted from the search appliance and consumed into your existing
analytical tool, or you can process the data using scripts that can be downloaded and
customized from Google Enterprise Labs.
Post Deployment 91
• How many pages a user clicked through
• Which result on the page they clicked on and where they went.
Whatever your solution, Google highly recommends that you provide a rich analytics
capability, regularly examine the data to refine your search deployment, and identify ways to
add additional value.
For information about advanced search reporting, see “Gathering Information about the
Search Experience” at http://code.google.com/apis/searchappliance/documentation/60/
admin_searchexp/ce_improving_search.html#gather.
However, after examining your users’ search behavior, you notice that 90% of searches for
“widget” are immediately followed by a second search for “gadget.” Similarly 50% of users
searching for “vacation” click on the fifth link—to your policy database.
Based on these observations, there are two immediate actions you might take to increase user
effectiveness:
• Activate query expansion and upload your own synonyms list, including an expansion that
equates widget and gadget, so that a search for “widget” automatically becomes a search
for “gadget.”
As a result of this enhancement, 90% of users running a search for widget or gadget will
find search twice as effective. Using query expansion, and adding your lexicon to the
Google Search Appliance is a quick way to increase search effectiveness immediately.
For information about using query expansion and KeyMatches, see “Creating the Search
Experience: Best Practices” at http://code.google.com/apis/searchappliance/documentation/60/
admin_searchexp/ce_improving_search.html.
Because you know that the corporate wiki is where your most useful content is, you can create
a result biasing profile that moves it higher in search results. By doing this, you ensure that the
corporate wiki appears in results where users can most quickly find it.
For information about creating result biasing profiles, see “Using Result Biasing to Influence
Result Ranking” at http://code.google.com/apis/searchappliance/documentation/60/
admin_searchexp/ce_improving_search.html#h1resbias.
You can also understand what content is important by using analytics to gather information
about user clicks. If your organization has an existing analytics solution in place, It may be
possible to use this solution to provide analytical insight into the user search experience. In
many cases, integration with a third-party analytical solution requires some effort to get
search-specific reporting, but there is substantial value that can be derived from the data.
Post Deployment 93
94 Google Search Appliance Deployment Guide
Putting the User First Chapter 8
The success of your deployment depends not only on the breadth and depth of search, but
also on how satisfying and effective the search experience is for users. There are many things
you can do to drive user satisfaction and increase use of the search solution. The following
sections discuss tools for enhancing the search experience:
Presentation methods
There are two primary methods of delivering the search experience to your users:
Choose an appropriate method for your users based on the outcomes you are trying to
achieve and technical requirements.
95
Google Search Appliance presentation layer
The Google Search Appliance uses an XSLT stylesheet for its presentation layer. Using this
built-in presentation layer has several advantages:
• All presentation is rendered on-box and delivered direct to the user. The search appliance
does not require any additional hardware to manage presentation.
• Built-in user features (such as query suggestions, dynamic result clusters, and so on) can
be enabled and delivered to users as simply as selecting a checkbox.
However, there are some limitations—most notably that highly sophisticated, interactive or
JavaScript-rich user interfaces are more challenging to deliver, primarily due to the declarative
nature of XSLT and security restrictions that prevent uploading of content to the search
appliance. If the search experience is implemented using the built-in presentation layer, all
JavaScript must be embedded directly into the output HTML pages, which may lead to
browser inefficiencies.
For information about using the Google Search Appliance presentation layer, see “Creating
the Search Experience” at http://code.google.com/apis/searchappliance/documentation/60/
admin_searchexp/ce_understanding.html.
• Presentation can take full advantage of the flexibility and richness of modern programming
languages, such as Java, Python, .NET or even Flash to provide an extremely rich and
interactive UI.
• Removing the rendering of content from the search appliance also removes the
processing required by the search appliance.
• Additional resources (such as style sheets, JavaScript files, images, and so on) can be
hosted on a separate server and delivered to client browsers as included resources,
improving perceived performance to users.
• Security can be managed at the application level by allowing the application to determine
the collections and front-ends a user is able to see.
For a diagram illustrating use of an application presentation layer, see “Search as a web
service” on page 64.
For information about search results in XML, see “XML Output” at http://code.google.com/apis/
searchappliance/documentation/60/xml_reference.html#results_xml.
For more information about front ends, see “Managing the Search Experience” at
http://code.google.com/apis/searchappliance/documentation/60/admin_searchexp/
ce_understanding.html#h1manexp.
• Product documentation
• Support requests
For example, a marketing or public relations department might want a visually rich, interactive
UI that enables them search for previous communications, video, audio and images. On the
other hand, IT support might want a fast, light UI that enables them to search for technical
content quickly.
To meet the different user interface needs of each department, a search appliance could have
two different front ends. To meet the different content needs of each department, a search
appliance could have multiple collections. Collections could be used to segment the index in
ways that serve the different departments.
If both departments need to search the same content, filtering, enrichment, and biasing
profiles can be used to provide a different set of results for each. While public-facing product
documentation is of primary interest to the marketing department, this content may be of
secondary interest to support, who should be able to find it, but as a secondary priority to
current support tickets.
Using front ends and collections together effectively can substantially improve the search
experience for all users through a powerful and flexible range of deployment options.
For more information, see “Using Collections with Front Ends” at http://code.google.com/apis/
searchappliance/documentation/60/admin_searchexp/ce_understanding.html#h2coll.
For example, Alpha Inc. is releasing a AlphaLyon 3.0, a new software version of their flagship
product. The company want to ensure that when users search for AlphaLyon, information
about release 3.0 appears among the top search results.
The Google Search Appliance offers several enrichment features. The following table lists
some of these features.
Dynamic result clusters • Dynamic result clusters show different topics for a
specific search term. These topics enable users to focus
on areas of interest while ignoring irrelevant information.
When a user clicks on any of the topics, the search
appliance returns a new, narrower set of results.
Result biasing • Result biasing enables you to influence the way that the
search appliance ranks a result, based on URL,
document date, or metadata in or associated with the
result. You can use result biasing to increase or
decrease the scores of specified sources, or types of
sources, in the search index. These local settings can
affect the order of the search results, and give a different
user groups different biasing profiles to a customized
search experience.
• As the user types “AlphaLyon” in the search box, query suggestions cause the search
query to auto-complete before the user finishes typing it. Alternative terms that narrow the
search, including “AlphaLyon 3.0,” also appear in a menu below the search box.
• A KeyMatch for AlphaLyon 3.0 appears at the top of the search results, proclaiming “New
Release! AlphaLyon 3.0 Documentation” that guides the user to the documentation for the
new release.
• Dynamic result clusters cause dynamically formed subcategories based on the results of
of the search to appear along with algorithmic results. Each subcategory groups similar
documents together. For AlphaLyon, such categories might include “AlphaLyon 3.0
product information,” “AlphaLyon 3.0 documentation,” and “AlphaLyon support options.”
Instead of reading through all search results, users can browse a subcategory.
• Result biasing causes documents about AlphaLyon 3.0 to appear higher in the algorithmic
search results than documents about earlier versions.
Because Alpha Inc. enables user-added results, their users have the capability of adding
search results for key words. For example, a user adds a result for “AlphaLyon 3.0 Installation
Guide” that appears on the results page when anyone searches using the keyword
“AlphaLyon.”
Alpha Inc. has also enabled alerts, so users can monitor topics, such as AlphaLyon 3.0, and
receive search results about them in emails.
For comprehensive information about all Google Search Appliance enhancement features,
see “Creating the Search Experience” at http://code.google.com/apis/searchappliance/
documentation/60/admin_searchexp/ce_understanding.html.
Google Enterprise Labs features are usually pre-built, ready to go, and easy to deploy.
Many of the Google Enterprise Labs experimental features eventually graduate to the search
appliance, and become part of the core product. For example, query suggestions, dynamic
result clusters, and user-added results all started on Google Enterprise Labs but have now
been incorporated into the core on-board capability.
Although experimental features are not supported by Google, certified Google Enterprise
partners are experienced with these capabilities and are able to help implement them. You can
find a Google Enterprise partner at the Google Enterprise partner directory at http://
www.google.com/enterprise/gep/directory.html.
One of the best ways to innovate is by capturing user feedback on what they like and don’t like
about the search solution, as well as understanding how they are using it. User feedback is
critical to successful deployment. To deliver value, not only must you deliver a great search
experience, but you need to have users actively using it.
• Implicit feedback
• A feedback link
• A user survey
Implicit feedback
By activating advanced search reporting, or another analytical capability, you can
automatically see what your users are doing, where they are succeeding, and how you can
help them be more effective. However, it’s important not only to capture this data, but also to
use it.
For information about advanced search reporting, see “Gathering Information about the
Search Experience at http://code.google.com/apis/searchappliance/documentation/60/
admin_searchexp/ce_improving_search.html#gather.
Feedback link
Make it easy for users to provide feedback by providing a link or an email address for
submitting their comments.
User survey
A user survey is a great tool to analyze how satisfying users find your search solution. Surveys
should be sent out regularly, and after each phase in your deployment, so that you can iterate
rapidly, and continue to delight your users. Appendix C, “Enterprise Search Satisfaction
Survey,” contains a sample user survey.
This appendix presents some best practices in the following major areas of search appliance
deployment:
• Crawl
• Feeds
• Index reset
• Collections
• Serving
• Security
• Ongoing administration
103
Use dual power sources
The GB-7007 and GB-9009 models ship with redundant power supplies. Even if your site does
not have dual power sources, it is beneficial to use both power supplies. At a minimum, each
power supply should be attached to a different circuit, and separate UPSs, if possible.
Crawl
When deploying a search appliance in a complex environment for the first time, it is best to
focus on the largest or most important content repositories, rather than trying to index
everything.
If the search appliance is putting too much load on your servers (crawling too aggressively),
you can change the default host load or add rules for specific hosts or time periods. Some
examples of when you would set specific host loads are:
• Limit crawl speed for "slow" hosts or hosts on slow network connections
• Crawl a new server quickly (and then drop the load down when complete)
Regular expressions are costly and can affect crawl and index speed. If you are using regular
expressions, you should optimize them for efficiency. For example, regexp:pdf$ is better than
regexp:pdf because the crawler only needs to check the end of the URL.
In most cases, the pattern definitions have little impact on crawl performance, but take care
when dealing with:
• Large number of patterns, or very complex patterns against very long URLs
In General:
• Group similar patterns together when possible: param=(foo|bar|goo) is better than three
separate patterns
Feeds
For near real-time indexing, use feeds. For example, a publication company might need to
ensure that all content is searchable as soon as it is published. A content feed or metadata
and URL feed might be the most effective way to get this content into the index.
There are a number of cases where feeds can enhance search deployment. For a detailed
discussion of use cases, see “Feeds” on page 54.
One scenario where an index reset may be warranted is if you have a lot of unlinked content
that is still in the index. In some cases, links are removed from web pages but the destination
content is still available. The Google Search Appliance continues to crawl these "orphan"
pages because it knows about them and they will not be removed from the index unless a 404
is returned (or they are otherwise excluded).
If you need to reset the index, export all of the URLs in the index before performing a reset.
Collections
The master index of the Google Search Appliance can be segmented into multiple collections.
Collections are useful for enabling users to narrow their searches to specific content areas.
Collections can also be used to provide segmented search results.
• You want to segment content into "Engineering," "Sales," "Marketing," "HR," and "All" and
to enable users to select which collection they wish to search.
• You want to provide the option for corporate users to search over public website content in
addition to corporate content.
Use dedicated service user identities to crawl protected content. Do not simply use the
administrator’s identity or an arbitrary user ID, as this might be prone to failure if the user
changes his password or leaves the organization.
Serving
In most situations, you should enable query expansion. Although query expansion can have a
good impact on search results relevancy and quality, it is disabled by default.
Creating your own query expansion dictionaries is a great way to provide synonyms for
acronyms, jargon, and company-specific terms.
Page Layout Helper • Use this option when you don't need to do much
customization to the default stylesheet. You can add
your own logo, change the header and footer, as well as
basic results options.
XSLT stylesheet • Use this option when you want to serve formatted results
directly from the search appliance and apply your own
stylesheet. This enables you to customize every aspect
of the results pages. Also useful if you want to return
your own XML schema, RSS, or JSON.
Security
Don't mix public and internal content on a public-facing machine. Even though it may be
possible to index internal (intranet) and external (website) content with the same Google
Search Appliance, keep that data separate if the search appliance can be accessed publicly.
Ongoing administration
While the Google Search Appliance does not typically require a large team to manage the
deployment, you need to carry out some regular administration tasks, including:
• The Google Search Appliance provides the ability to send query logs to an external
syslog server. This can be especially useful if you have sever search appliances in a
load-balanced configuration and wish to aggregate the logs to one central place.
Administration Tips:
• Monitor your document count. If you are approaching your index limit, consider upgrading
to include new valuable content.
• Look for unexpected document volumes from specific repositories—this may indicate
unexpected behavior, such as multiple URLs for the same document if a session ID is
appended or similar.
This appendix presents some technical solutions for common challenges in the following
major areas of search appliance deployment:
• Document relevancy
• Other areas
Google does not provide technical support for configuring servers or other third-party products
outside of the Google Search Appliance, nor does Google support solution design activities. In
the event of a non-Google Search Appliance issue, you should contact your IT systems
administrator. GOOGLE ACCEPTS NO RESPONSIBILITY FOR THIRD-PARTY
PRODUCTS. Please consult a product’s web site for the latest configuration and support
information. You might also contact Google Solutions Providers for consulting services and
options.
• Check to see if the repository supports access by means of http (or https)? If so, then
index using standard HTTP start URLs.
• Check to see if there is a partner who has a connector for that particular repository that
you can use to index the content.
• If there are APIs available you could write a connector, or use them to extract the content,
generate a feed, and push the content into the Google Search Appliance.
I need the Google Search Appliance to crawl my Portal, but the cookie is strange and
doesn't conform to the RFCs—how do I crawl?
111
• The Google Search Appliance is designed to support internet standards known as RFCs.
When a content source does not follow RFCs, you will need to manage non-standards
based implementation with supplemental technologies.
How can I explicitly specify the file types to be crawled rather than exclude what I do
not want to be crawled?
• This is a typical requirement in the case where file shares are to be indexed. Do this by
deleting everything from the "Do-not-crawl patterns" field and add a regular expression to
the crawl patterns that looks something like this:
regexpIgnoreCase:^http://host\\.domain\\.com/folder/
.*(.doc$|.xls$|.ppt$|.docx$|.xlsx$|.pptx$|.rtf$|.pdf$|.txt$|.htm$|.html$|/$).
Within the brackets you can explicitly specify the file types to be crawled separated with
the pipe sign. Remember that the sub-string /$ is mandatory in order to traverse through
the directories.
(Note that this may not work if the content is streamed by means of an application so that
the file extension is no longer part of the URL.)
How can I have the Google Search Appliance index and serve public emails and
messages from MS Exchange 2003?
• You can index all the MS Exchange content with a Google Search Appliance out-of-the-
box if you have Outlook Web Access (OWA) enabled. With OWA all emails and contacts
becomes accessible by means of HTTP and all is HTTP Basic protected by default (other
options are possible). This means if you set up the crawl patterns and the crawler access
appropriately, you can get everything into the index and serve it either with an AuthZ
check by means of a HEAD request or you can set up group policies respectively.
• The full set of instructions on how do this can be found at: http//docs.google.com/
View?id=dd6k8c37_41gkc6dwfj
I need to add URLs to be crawled to my Google Search Appliance dynamically. How can
I do this?
• While you can feed URLs into the Google Search Appliance, they have to already exist in
the Follow and Crawl Patterns. Therefore, in order to add them to the follow and crawl
patterns dynamically, you will have to use the Google Search Appliance Admin API to do
this, then you can either use the Admin API to add then to the start URLs path, or you can
create a web feed to push them into the search appliance.
I am not sure whether my forms authentication protected site can be crawled without
any problems. How can I find out?
• Check whether the login procedure conforms to the usual http standard:
1. login to your web site and copy the URL of one of your forms authentication protected
documents.
2. Close the browser and/or make sure you are really logged out.
3. Paste the URL into your browser in order to re-open the document.
The browser should redirect you to the Log-in pages since you are not yet logged in. If
your server responds with "HTTP/1.x 302 Moved Temporarily" and a redirect specified in
the header field "Location: <the URL of the log in page>" to requests from an
unauthenticated user it behaves standard conformant. In this case the Google Search
Appliance will be able to get access to your protected documents. If your server responds
with a "HTTP/1.x 200 OK" and displays the login page (or any other non-standard
conformant way on order to display the login form) you need to find another way.
• Check to ensure that the login page does not use JavaScript. Google Search Appliance
forms auth wizard can tolerate forms with some basic JavaScript such as those perform
range check prior to submission. Such JS code normally is okay, as long as the form
submission itself is implemented as JavaScript style, such as "JavaScript:".
Also be careful when onSubmit() function is used as the form submission behavior would
be different for the wizard. If it does, create a login page that does not use JavaScript.
Also, be careful that the forms page, if it uses javascript, does not alter/add parameters
before submitting them. If it does, these will need to be adapted into the non-JavaScript
version of the login page.
As well, any hidden parameters, and so on will also need to be incorporated in order to
allow the Google Search Appliance to successfully login and access the website.
• When JavaScript code cannot be easily removed, it is possible to work around this using
additional tools, such as the Firefox add-on called Firebug. Use this to intercept the
request and manipulate the objects. This does not always work, but in cases when
particular static fields are to be added, it should work.
For example, some application may prefix the username with an internal code before it is
submitted, but the prefix is static in most cases, the add-on would be easiest to move the
wizard forward without needing a non-JS form.
• Also, some times the internal sites use SSL but either without a valid certificate or without
configuring the CA in Google Search Appliance. In this case, we can try to use plain HTTP
(provided that this is still supported and allowed). During search, the style sheet can be
customized so that the protocol of the results is then converted from http to https.
I have documents that are larger than 30M in size. How can I get these indexed?
1. Convert these documents to text (there are a number of freeware and shareware
applications available to do this).
3. Apply a meta tag that has a unique name and has a value of the original file location.
This HTML document can then be indexed or fed in to the Google Search Appliance, thus
allowing the textual content to be searched.
This will allow the content to be searched, and the original document to be accessed by
means of the results list. As well, be aware that the relevancy depends among other
factors on text formatting. Therefore, this solution might affect the relevancy.
I am trying to get the Google Search Appliance to crawl a URL contained within
JavaScript but the crawler won't pick it up. How can I get it?
• Use a feed
• Use anything other than crawling to make up the site coverage deficiency due to the use
of JavaScript, such as a web feed.
How can the Google Search Appliance index a personalized portal? What about a portal
that allows both the guest users and registered users using the same URLs?
• A dynamic application is in most cases template-based. Add google on/off tag to avoid
indexing redundant and/or contextual info such as header, footer, left nav, top nav, right
panel, and so on.
• Any personalized content fragment (such as greeting messages, message inbox portlet)
should be excluded from being indexed either by means of do-not-crawl patterns and
googleon/googleoff tags.
• If a URL is served for both guest and member with different behavior, the application
should accommodate the crawler's need to differentiate those versions. Ideally, the
crawler could start off with an extra parameter such as "&as=guest" or "&as=member,"
and the application should preserve this link throughout the application. A collection
should be generated based on the extra parameter in the URL patterns. And the front end
style sheet should strip them out when rendering results. (For security reasons, this extra
parameter should only be processed by the application if the requests are from know
Google Search Appliance IP addresses.)
I use a CMS system that is easier to crawl. But the content is published to a different
production system, which is not suitable for crawl (or not allowed due to load issues).
What are the things that I need to consider?
• URL conversion.
• Different security mechanisms. Google Search Appliance assumes that the security used
for crawl would also be used for authorization. Try to use policy ACL to work around this
issue.
I need to apply metadata to URLs that the Google Search Appliance is crawling before it
is indexed. How can I do this?
• Use a proxy when crawling and apply the metadata, based on programmatic rules, to the
data before passing it through to the Google Search Appliance.
• Web-enabling the file server will allow you to index the content. If you have a good web
server, such as Apache httpd, which can be configured for strong security, there shouldn't
be any security concerns. It can also be configured so that only the Google Search
Appliance's IP address can access it, making it completely inaccessible to any other
machine.
I have Novell Netware with lack of CIFS and web-enabled support. How can I integrate
the Google Search Appliance with this, by means of a connector or some other
mechanism?
• By utilizing code, which uses the Novell Java Libraries to check permissions against the e-
Directories (which has a concept called "effective rights"), you can crawl over CIFS
enabled drives. If you query for this on a per-document basis, you can get permissions. An
administrator will need to set it up - and there is a bit of trial and error getting the right
permissions setup because Effective rights comes from both the directory, and the parent
container.
• You can also use the instructions on How to Index and Serve Novell Netware File Servers
with a Google Search Appliance which can be found at: http://docs.google.com/
View?id=dd6k8c37_42ch8twqcg.
I want the Google Search Appliance to index content from Oracle Content Server/
Stellent. How can I accomplish this?
• By using GoogleOn and GoogleOff tags you can prevent all, or portions, of a web page
from being indexed. The full use of these tags can be found at: http://code.google.com/apis/
searchappliance/documentation/60/admin_crawl/Preparing.html#pagepart.
• The following URL patterns will include the top three subdirectories on the site
www.mysite.com:
regexp:www\\.mysite\\.com/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*/[^/]*$
• This problem primarily has an impact on serving. Crawling executes in the background, so
while this has an impact on the speed of content acquisition, it does not have an impact on
user experience. To improve performance during search, you could use policy ACL's and
early binding to allow the Google Search Appliance to manage authorization in a
performance-optimized way.
When the built-in UI is served through secure HTTP (for example, access=[a|s]), and the
interface has customized page elements, for example, a logo - served from a non-
secure HTTP source, web browsers will usually display a warning to alert that secure
and non-secure page components are being displayed every time the page is loaded. Is
there any way to suppress the warning?
• Either the images have to be served from a secure server, or the browser's options will
have to be set to suppress the warning, but that would require a change to every users
browser and it isn't advisable.
We require a unified search, across multiple secure repositories, on one Google Search
Appliance. How can we implement this with silent authentication or single sign-on?
I have a specific page (or pages) indexed into my Google Search Appliance that I would
like to remove. How can I accomplish this?
• If you want this page visible in other front ends, then you can force a front end to ignore it,
by adding this URL to the Remove URLs tab for that specific front end.
• If you would like to completely remove this document from the Google Search Appliance's
index, then you can use a delete feed. For more information on creating a feed which will
delete content, see the appropriate section in the Google Search Appliance Feeds Guide:
http://code.google.com/apis/searchappliance/documentation/60/feedsguide.html#removing_url
• You can use the remove or recrawl URL tool in the Google Search Appliance Admin
Toolkit (http://code.google.com/p/gsa-admin-toolkit/).
When indexing by means of SMB the directory 'pages' get index and can appear in the
results. How can I make sure that these pages are not shown in the search results?
• Put ./$ as an exclude pattern for a collection and the directory 'pages' will not be part of the
collection.
• You can use SMB to crawl the content. The only real issue to watch out for is the fact that
the Mac OS won't initiate the SMB processes until someone initiates a connection.
Document relevancy
How can the Google Search Appliance sort the results by other criteria than relevancy
and date?
• It is exactly the purpose of a search engine to sort the search result by relevancy.
Everything else is rather the output of a data base query. Unlike the Google web search
the Google Search Appliance can sort results also by date.
If you need to sort the results by any other numeric value you have, you can abuse the
date sort feature. To do so convert the value to an ISO-8601 date format (YYYY-MM-DD)
and insert it to a meta tag in your document. The lowest value must not be earlier than
January 1, 1970. Then set up the respective name of the meta tag in the section
"Document Dates" in the admin console. The Google Search Appliance considers the
value of this meta tag as the document date and can sort it by this value.
I want to promote a URL to the top of the results. How can I do this?
• Use KeyMatches.
• Create a result biasing policy which increases the relevancy of documents based upon the
URL. Attach this policy to the appropriate front end.
How can I increase the relevancy, in the search results, of more recent documents?
• Create a result biasing policy which increases the relevancy of documents based upon the
date that they were last modified. Then, attach this policy to the appropriate front end.
How can I modify the relevancy of specific URLs, either increasing or decreasing it?
• Specify rescoring for results that exactly match specific URL prefixes
• Influence results rankings programmatically for an unlimited number of URL prefixes
• A front end will only reload itself into memory every 15 minutes (or even longer).
Therefore, in order to force a reload of the front end, you must use the parameter
proxyreload=1 in the query URL at least once after the style sheet has been modified. This
parameter should only be used for a refresh during development and not in production as
it will negatively impact the performance of the Google Search Appliance.
How can I give developer access to the front end so that they can make changes
without being able to affect my KeyMatches, and so on?
• You can create two front ends, using some naming convention. For example, use the one
called "my_frontend" to manage KeyMatches, related queries, filters, remove URLs, and
OneBoxes (collectively known as "client"). Then create another one called
"my_frontend_ss" to manage the user interface (or output as it is denoted in the Admin
Console), which is referred to as "proxystylesheet".
• Give the UI developer access to "my_frontend_ss" only so they can update their style
sheet there.
• Retain control over "my_frontend" where user's search experience is managed by a non-
UI developer.
• If you want to make use of most configurations in a front end for different user interfaces,
while you want to have different options for query expansion policies and/or result biasing
policies, do not create multiple front ends for this. Use "entqr" and "entsp" instead.
Other areas
I don't want documents with credit card numbers or SSN (or some other pattern) to be
returned in a search. How can I ensure this?
2. For each URL, run a third-party program to make sure they are of good quality (that is, no
bad words, no sensitive information).
or
• You can have the Google Search Appliance crawl through a proxy, and have the proxy
block content that matches specific patterns.
When using federation (dynamic scalability) between two or more Google Search
Appliances, do I require 'real' signed certificates?
• While the federation between Google Search Appliances can be done using the self-
signed certificates, we recommend that customers do not use them, but rather, use their
own 'real' signed certificates.
How can I see the XML that the Google Search Appliance is sending back before it gets
transformed?
• For results remove the proxystylesheet parameter and value. For example:
• http://gsahost.domain.com/
search?q=query&btnG=Google+Search&access=p&client=default_frontend&output=
xml_no_dtd&sort=date:D:L:d1&entqr=0&oe=UTF-8&ie=UTF-
8&ud=1&site=default_collection
• For dynamic results clustering, you can directly query the Google Search Appliance for the
XML output. For example:
• http://gsahost.domain.com/
cluster?q=query&site=default_collection&client=default_frontend&coutput=xml
How can I troubleshoot my Google Search Appliance because something isn't working
as expected?
• This package includes numerous monitoring scripts, reverse proxies, admin scripts, and
so on.
• If absolutely need this, one should use a custom parameter to indicate a language choice
(such as "en" or "fr" or "es", and so on) for the search interface. The application should
receive that language preference and convert it into Accept-Language request header to
the Google Search Appliance.
• Create a simple HTML page, that calls a back-end program that uses the Admin Console
API to generate and export reports.
• Sync the Google Search Appliances logs with an external syslog service and create your
own reports.
How can I integrate the Google Search Appliance into a non-web application?
• The Google Search Appliance will accept HTTP requests, and can return XML (or other
formats after having been transformed by means of an XSLT). The returned results can
then be parsed by an application, written in the language of your choosing, and then used
for whatever purpose the application requires.
• Say PR manages a collection for "corp_cnt," marketing manages a second collection for
"mktn_cnt," engineering manages a third collection for "engr_cnt." There are two user
groups, one need "corp_cnt" and "mktn_cnt," and the other need "corp_cnt" and
"mktn_cnt." In this case, it is better not to create two collections for these two user groups,
because there are three distinct owners of these content. So, create three collections as
above. When search is done, use "site=corp_cnt|mktn_cnt" and "site=corp_cnt|engr_cnt"
separately.
• Engineering
• Finance
• Human Resources
• Sales
• Marketing
• Research
3. How often does your result show up in the top 10 (first page)?
• Never
121
4. How often does your result show up as the first result?
• Never
5. How often do you click on one of the Recommended Links (the shaded key matches at
the very top of the results)?
• Sometimes
• Never
• Excellent
• Sufficient
• Unacceptable
8. Which content sources would you like to see indexed (added to the search results)?
• _________________________________
• _________________________________
• _________________________________
• _________________________________
• _________________________________
• _________________________________
9. Have you ever had documents that you knew existed but couldn't find them with
search?
• Yes
• No
• It's alright
____________________________________________________
____________________________________________________
____________________________________________________
http://groups.google.com/group/Google-Search-Appliance
http://www.google.com/enterprise/marketplace/
https://support.google.com/enterprise/terms
http://code.google.com/apis/searchappliance/documentation/index.html
http://www.learngsa.com
http://code.google.com/apis/searchappliance/documentation/remote_access/remote_access.html
125
126 Google Search Appliance Deployment Guide