GSA Deployment Guide

Google Search Appliance
Deployment Guide
September 2009
Google Inc.
1600 Amphitheatre Parkway
Mountain View, CA 94043
www.google.com
21 September 2009
© Copyright 2009 Google Inc. All rights reserved.
Google, the Google logo, Google Search Appliance, GSA, the Google Mini, Google Site Search, and GSS are trademarks,
registered trademarks, or service marks of Google Inc. All other trademarks are the property of their respective owners.
Use of any Google solution is governed by the license agreement included in your original contract. Any intellectual property
rights relating to the Google services are and shall remain the exclusive property of Google, Inc. and/or its subsidiaries
(“Google”). You may not attempt to decipher, decompile, or develop source code for any Google product or service offering,
or knowingly allow others to do so.
Google documentation may not be sold, resold, licensed or sublicensed and may not be transferred without the prior written
consent of Google. Your right to copy this manual is limited by copyright law. Making copies, adaptations, or compilation works,
without prior written authorization of Google is prohibited by law and constitutes a punishable violation of the law. No part of
this manual may be reproduced in whole or in part without the express written consent of Google. Copyright © by Google Inc.
Google provides this publication “as is” without warranty of any either express or implied, including but not limited to the implied
warranties of merchantability or fitness for a particular purpose. Google may revise this publication from time to time without
notice. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions; therefore, this
statement may not apply to you.
2 Google Search Appliance Deployment Guide

Contents
Chapter 1: Introduction....................................................................................... 5
Welcome to the Google Search Appliance............................................................ 5
About this guide..................................................................................................... 6
Disclaimer for Third-Party Product Configurations ................................................ 8
Chapter 2: Understanding Your Deployment.................................................... 9

Understanding your users ................................................................................... 10
Understanding your content ................................................................................ 11
Understanding your business processes............................................................. 12
Understanding your architecture ......................................................................... 12
Chapter 3: Planning for Successful Deployment ........................................... 15

Capturing requirements ....................................................................................... 16
Identifying phases................................................................................................ 22
Defining success criteria...................................................................................... 28
Transitioning to business as usual ...................................................................... 28
Chapter 4: Project Scenarios ........................................................................... 31

Basic search on a public website ........................................................................ 32
Basic internal search ........................................................................................... 35
Internal search over intranet, file system, SharePoint, and Notes....................... 38
Internal search including CMS, database, corporate application assets ............. 40
Chapter 5: Deployment Architecture ............................................................... 45

Sizing the index ................................................................................................... 45
Architecting for scale and performance ............................................................... 48
Architecting for reliability...................................................................................... 50
Architecting for reach........................................................................................... 52
Architecting for security ....................................................................................... 56
Enhancing technologies ...................................................................................... 60
Chapter 6: Deployment Scenarios ................................................................... 61

Staging/Development environment ..................................................................... 62
Simple architecture.............................................................................................. 63
Search as a web service ..................................................................................... 64
Contents 3
High availability architecture................................................................................ 65
Disaster recovery deployment architecture ......................................................... 69
Integrated architectures....................................................................................... 71
Security solutions ................................................................................................ 75
Federated architecture ........................................................................................ 79
Chapter 7: Post Deployment ............................................................................ 83

Update planning .................................................................................................. 83
Planning for renewal............................................................................................ 87
Optimizing support............................................................................................... 89
Using reports to enhance the search experience ................................................ 91
Chapter 8: Putting the User First ..................................................................... 95

Presentation methods.......................................................................................... 95
Enrichment features ............................................................................................ 99
Google Enterprise Labs..................................................................................... 100
User feedback ................................................................................................... 101
Appendix A: Best Practices............................................................................ 103

Datacenter and installation ................................................................................ 103
Crawl ................................................................................................................. 104
Feeds................................................................................................................. 105
Index reset......................................................................................................... 106
Collections ......................................................................................................... 106
Serving .............................................................................................................. 106
Front end stylesheets ........................................................................................ 108
Security.............................................................................................................. 108
Ongoing administration...................................................................................... 108
Appendix B: Technical Solutions for Common Challenges ........................ 111

Crawling and indexing content .......................................................................... 111
Security and serving secure content ................................................................. 116
Document relevancy.......................................................................................... 117
Interfaces and front end customization.............................................................. 118
Other areas........................................................................................................ 118
Appendix C: Enterprise Search Satisfaction Survey ................................... 121
Appendix D: Other Resources ....................................................................... 125

Introduction Chapter 1
Welcome to the Google Search Appliance

The Google Search Appliance is a full-featured enterprise search solution that brings Google’s
award-winning search technology to the enterprise. The Google Search Appliance provides
high levels of search relevancy, scalability, and redundancy to meet the ever-growing,
mission-critical information access demands of any organization.
Unlike many enterprise applications, the Google Search Appliance is designed to be self
sufficient: hardware, software, networking, storage, and security support are built in, and can
be easily supplemented with additional capabilities.
This document outlines several considerations for successfully deploying Google Search
Appliances to meet the document capacity, scalability, and redundancy needs of an
enterprise.
Right for your business

The Google Search Appliance delivers the same powerful search algorithms as Google.com
from a self-contained appliance. Your users will get the same great relevance and experience
searching your company’s information in the office as they get searching the web at home.
Great value
Because the Google Search Appliance is self-contained, it delivers core search capabilities
out of the box with no additional hardware required. However, you can supplement the search
appliance with off-box capabilities to deliver universal search at a compelling price. Ongoing
operating cost is lowered by reducing the effort to administer and maintain a search solution
substantially, delivering powerful, intuitive search at a low, compelling Total Cost of Ownership
(TCO).
5
Easy integration
The Google Search Appliance seamlessly integrates with existing information technology (IT)
infrastructures through industry standards and best practices. Custom integration can be
delivered through open standards, such as Security Assertion Markup Language (SAML) for
Single Sign-On (SSO) and heterogeneous security, and well-documented, standard
Application Programming Interfaces (APIs).
Constant innovation
Innovation is the hallmark of Google Enterprise. The Google Search Appliance takes
advantage of the innovations tested on Google.com and proven by hundreds of millions of
users worldwide. In addition to regular software releases, you can add innovations to a search
solution from Google Enterprise Labs or by harnessing the power of Google’s cloud
capabilities to deliver core search capability.
Continuous increase in ROI

The Google Search Appliance delivers immediate return on investment (ROI), increasing
rapidly with short deployment cycles.
The Google Search Appliance’s flexible architecture and open technologies enable you to
deploy it rapidly. Once deployed, the search appliance offers increased value by unlocking
more of the value in your business’s information assets through continuous innovation,
incorporation of additional content, and rich user functionality.
About this guide

This guide provides an overview of deploying a search solution using the enterprise-class
Google Search Appliance. The focus of this guide is on best practices and proven approaches
to architecture and deployment methodologies.
This guide assumes basic knowledge of the Google Search Appliance. However, this guide is
not a technical “how-to” document. For in-depth information, visit Google’s rich and
comprehensive public search appliance documentation at http://code.google.com/apis/
searchappliance/documentation/index.html.
A search solution can be deployed as a traditional monolithic project or by using agile, even
extreme project methodologies. Whatever the project methodology, there are guiding
principles that have been used in most successful search implementations. This document
discusses these guiding principles, giving you the information you need to plan your
deployment with the right phases or micro-phases.

What’s in this guide
This guide focuses on deployment best practices. There are several components to this,
including:
• Approaches for planning and executing a Google Search Appliance deployment
• Architectural best practices
• Techniques to increase adoption and user satisfaction
In this guide, you can also find comprehensive information about the following topics:
• The foundations of a successful deployment
• Ensuring your deployment is optimized for support
• Designing an architecture to meet your technical and business requirements
• How to plan deployment phases to achieve quick wins, while delivering ongoing value
• Supplementing the search appliance with enriching technologies
• Enabling core features to maximize value
Who this guide is for

This guide is primarily for IT administrators and project managers who plan and manage a
deployment of the Google Search Appliance, as well as certified Google Enterprise partners
who assist customers in deploying their search appliances.
This guide also provides useful information for other technical and managerial personnel who
are involved in making decisions about IT infrastructure for your company.
How to use this guide

Use this guide as a starting point to help plan and manage your Google Search Appliance
deployment. The concepts, instructions, and advice in this guide are intended to provide
general information only. Because organizations have a wide variety of IT infrastructures, the
methods you ultimately use to set up and manage your search deployment might differ from
those described in this guide.
Although Google recommends that you read this entire guide, you don’t have to. Depending
on your organization’s infrastructure, your goals, and your own experience, you can use this
guide as a reference and read just the sections that are applicable to you.
It is recommended that implementation of a search solution proceed with the support of a

Google Enterprise partner. Your Google representative will be able to recommend one, or you
can find them yourself in the Google Solutions Market Place at http://www.google.com/
enterprise/marketplace/. Partners may use their own methodologies or enhance the contents of
this guide based on their experiences in the field.
Introduction 7
Resources that complement this guide
For a detailed list of the resources that this guide refers to, see “Other Resources” on
page 125.
Where to find the latest version of this guide

Google continually enhances its products and services, so the content of this guide will
change from time to time. To ensure you have the most up-to-date version of this guide, visit
www.learngsa.com.
How to provide comments about this guide

Google values your feedback. If you have comments about this guide or suggestions for its
improvement, please send an email message to:
search-deployment-guide@google.com
In your message, be sure to tell us the specific section to which your comment applies.
Thanks!
Disclaimer for Third-Party Product Configurations

Parts of this guide describe how Google products work with diverse customer environments
and configurations that Google recommends. These guidelines are designed to work with
common environments and deployment scenarios, and should be adapted to your
environment. Any changes to your environment, including installation of the Google Search
Appliance and related technologies, should be made in conjunction with the oversight and
approval of your IT teams.
Google does not provide technical support for configuring servers or other third-party products
outside of the Google Search Appliance, nor does Google support solution design activities. In
the event of a non-Google Search Appliance issue, you should contact your IT systems
administrator. GOOGLE ACCEPTS NO RESPONSIBILITY FOR THIRD-PARTY PRODUCTS.
Please consult a product’s web site for the latest configuration and support information. You
might also contact Google Solutions Providers for consulting services and options.

Understanding Your Deployment Chapter 2
To make the most out of your search deployment, you need to understand how users in your
organization will use search. You also need to understand the content and processes that will
benefit from search, and the architecture that will support it.
This chapter presents issues and questions that will help you in:
• Understanding your users, as described page 10
• Understanding your content, as described on page 11
• Understanding your business processes, as described on page 12
• Understanding your architecture, as described on page 12
The information that you gather as you address the issues listed in this chapter helps you to
define your deployment architecture and project plan.
For a simple deployment, you might gather information in a single meeting. For more complex
deployments, you might use a series of workshops and surveys.
9
Understanding your users
The success of your deployment hinges on how much your users use the search solution and
how effectively they do so.
The Google Search Appliance delivers powerful search capabilities out of the box, including a
search experience that the vast majority of your users are already familiar with from
Google.com. However, you can substantially enhance the user appeal and overall richness of
the search experience by understanding your users and what they will be trying to do with
search.
To understand your users and their search needs, consider the following questions.
How many users do you have • Are users internal, external, or both?
and where are they?
What will your users be using • It’s not just a search capability – what benefit will your
the search appliance for? users get?
• What does the search experience need to provide for
users to regard it as successful?
Are there different groups of • Do they require specific search capabilities?

users?
• Do different user communities need different content or
a different search experience?
• How important is search speed to each group?
• What is the relative sophistication level of users in each
group?
How will the users typically • Through a portal?

access search?
• Through a dedicated search page?
• Integrated with Google Desktop search so that localized
content and enterprise content are brought together?
How can advanced search • Do you need mobile functionality?

capabilities facilitate your users’
daily tasks? • Do you need to consider resource logistics?

Understanding your content
You have set a goal of delivering more relevant information to your users’ desktops by using
the Google Search Appliance. To achieve this goal, you need to identify what content your
users will need. As you identify content, consider a variety of sources—and remember, not
everything needs to be included on “day one.”
As part of this activity, get an understanding of what your index capacity needs will be. For
information about this topic, see “Sizing the index” on page 45.
To understand your content, consider the following questions.
What are your content sources? • Typical content sources that are often incorporated into a
search deployment include:
• Intranet sites
• Your company website(s)
• File systems and shared drives
• Content Management Systems (CMS), such as
Documentum
• Record/Document Management Systems (RMS/
DMS)
• Portals or collaboration sites, such as SharePoint
• Archives
• Databases
• Line Of Business (LOB) applications
• Other structured data
What are the details about each • For each content source, identify:
content source?
• How the content can be accessed
• Roughly how many documents it contains
• Whether the content is:
- Structured, for example, customer records
- Unstructured, for example, a Word document
- Both, for example, a customer letter
(unstructured) in an RMS (structured)
• Whether the content is secured
• How content is secured
• Who uses it (or who you want to use it)
• How important it is
• How frequently it changes
• What kind of publishing process (if any) governs
content revisions
Understanding Your Deployment 11

Understanding your business processes
Business processes rarely exist in isolation. As you think about all the information that you will
be putting at your users’ fingertips, identify the processes where being able to find rich sets of
information will enhance or streamline processes and how these processes relate to one
another.
For example, think how much faster a call center employee could answer questions about a
refund policy for a purchased product if she can simply search the policy database—and bring
up the purchase order in the same rich search window.
In many cases, you might discover that processes also produce information that you want to
make available through search. Also, it may be valuable to have visibility over in-flight
business processes, such as being able to search currently open cases in a support queue.
So you might want to enable the search appliance to crawl this information or otherwise
integrate it with the search appliance.
Understanding your architecture

Just as your processes work together to orchestrate the business of getting things done, your
IT architecture components work together to deliver all of the various solutions. The success
of a search implementation project also requires a thorough understanding of how it will
interrelate with the other systems in your IT ecosystem.
Think about your physical network design, and where the content is located—both
geographically and from a network design perspective. Also, think about your requirements for
security. Security architecture is particularly important for internal deployments of the search
appliance, and requires planning.

To understand your architecture, consider the following questions.
What are your physical • Are the content systems located on fast Ethernet
systems? switches?
• What are the peak usage times for each content system
– daily, weekly, monthly and/or quarterly?
• Will the search appliance be located on a part of the
network that requires access through a firewall or proxy
to get to the content?
What is the security • Do you have a single security mechanism for all content,
infrastructure surrounding you or do you have a “heterogeneous” authentication/
content? authorization environment?
• Will users require several identities/passwords to access
all protected content, or is there a single sign-on solution
in place?
• Do you have Active Directory (AD)? What version? Is
Active Directory installed in Native Mode or Mixed
Mode?
• Do you have NTLM v2?
Understanding Your Deployment 13

Planning for Successful Deployment Chapter 3
A successful search solution is conceptually very simple: help users find the information they
are looking for. Make search fast, make it easy, and make it relevant.
The Google Search Appliance takes care of the speed, ease and relevance. But you need to
plan and execute the project to take full advantage of the power of the search appliance. Key
to this approach is remaining focused on short delivery cycles and structuring work around
this.
Every deployment of a Google search solution is unique. You might be providing search
across SharePoint content and extending core search with purchase orders from SAP. Or you
might be providing search of the hundreds of thousands of documents that businesses tend to
accumulate over time, bringing them together with policy documents, and the contact details
of the people who wrote them.
Although each deployment has different content sources, security requirements, and user
needs, there are core planning activities with fundamental guiding principles that apply to all
search deployments. This chapter focuses on the following core planning activities:
• Capturing requirements, described on page 16
• Identifying phases, described on page 22
• Defining success criteria, described on page 28
• Transitioning to business as usual, described on page 28
For scenario-based example deployment programs, see Chapter 4, “Project Scenarios.”
15
Capturing requirements
As you capture requirements, group them into related sets that you can prioritize and align
with phases of work. In general, focus on the following areas:
• User requirements, described in the following section
• Content and security requirements, described on page 18
• Performance and scalability requirements, described on page 20
• Administration and reporting requirements, described on page 22
User requirements
Understand what is important to make the deployment successful from the user perspective.
In general, user requirements focus on:
• Usability, described in the following section
• Breadth and depth, described on page 16
• Communication and feedback, described on page 17
Usability
For users, search should not be a chore. Defining usability requirements can help ensure that
users find your search solution intuitive and effective.
As you identify usability requirements, consider the following issues:
• What are the usability features that really make the search solution resonate with users?
• The Google Search Appliance offers many simple-to-implement, on-box usability

features, such as Query Suggestions. For more information about this and other
usability features see “Enrichment features” on page 99.
• What speed requirements do users have?
• AJAX style technologies can dramatically enhance perceived responsiveness and

performance, while providing a richer search experience.
In general, meet usability requirements as early in the release cycle as possible because
these are not typically tied to content sources and they can get users excited about the search
solution.
Breadth and depth

A search solution needs to meet the demands of your user community. Defining breadth and
depth requirements help ensure that you have covered a wide enough user group while
providing the right content for them.
As you identify breadth and depth requirements, consider the following issues:

• What do the user groups look like?
• Where possible, the largest groups and the users experiencing the most frustration
today should be brought on first.
• Using search appliance front ends, you can present a different look and feel and
different content to various users, based on their needs. For information about front
ends, see “Using the search appliance’s front ends” on page 97.
• What are they trying to find now, but are frustrated that they can’t?
• This is the content that should be in early phases.

• User onboarding should be aligned with the inclusion of the content they are looking
for. That is, try not to give the users a search capability before including the content
they will be looking for.
• Do some users have more sophisticated search needs?
• Out-of-the-box advanced search can easily be augmented through rich use of

metadata and other core functionality.
• Where this is not required, keep search simple, but functionally rich.
Communication and feedback

User feedback is an effective tool for identifying usability issues. When you solicit feedback,
you let users know that their opinions about the search deployment are important. Defining
communication and feedback requirements ensures that you give users the ability to provide
input to the implementation team on the search deployment—what is working for them and
what is not working.
As you identify communication and feedback requirements, consider the following issues:
• What do you need to communicate to users?
• In addition to adding new content and exciting new features, it’s important to make
sure to tell your users about them to keep them excited about the product, and get
kudos on your successes.
• Because most of your users already know how to use Google search technology,
training needs typically are minimal, but make sure your users know that they can now
search enterprise content with the same ease as they search the internet at home.
• How will you get feedback from users?
• User feedback is one of the best measure of success. Consider conducting periodic
surveys with user groups. See the sample search satisfaction survey on page 121.
• Also consider providing a feedback link for users.
Planning for Successful Deployment 17

Content and security requirements
For most organizations, the following two aspects of a search deployment typically go hand-in-
hand:
• Content, described in the following section
• Security, described on page 19
Scenarios that encompass content and security can range in complexity from completely
unsecured public website pages to complex integration with an Enterprise Resource Planning
(ERP) system such as SAP or PeopleSoft, and everything in between.
Plan your end-state architecture in the early phases, but also phase in both content and
security. In other words, don’t delay delivering a great search experience to your users
because you want to index every last scrap of content or implement a security framework they
won’t need until later.
Content
In general, analyze all potential repositories of organizational information. Although the
Google Search Appliance excels at providing powerful, fast, and relevant search across
unstructured content, you should not exclude structured content, such as your data
warehouse, transactional systems, and so on.
It is important to understand how content sources relate to each other, as this will help you
define how to phase deployment of content. For example, content from a case management
system may be supplemented effectively with content from a product catalog, enabling users
to see not only product information, but also the types of problems and issues that users
encounter when using the products.
The following table lists various types of structured and unstructured content sources and
considerations that can help you define how to phase its deployment.
Unstructured
Structured/
(L/M/H)
Complexity
Content source Consideration
File systems U L • SMB or HTTP
Public websites U L • HTTP
Databases S L • Use database crawl or a feed, or web enable
Intranet U L • Might need to consider security

websites
Staff portal U M-H • Might need to consider security complexities

• Might need to account for non-unique URLs (the
same URL containing different content, based on
user role)

Unstructured
Structured/
(L/M/H)
Complexity
Content source Consideration
CMS U L-M • Might have metadata to leverage

• Need to determine if it can be crawled natively
• Might need to consider security
LOB S L-H • If web enabled, might be able to crawl these

applications (for content sources
example, Lotus
• Might require a connector
Notes)
• Security
Enterprise S M-H • Need to identify core data

applications (for
• Might use feeds or a connector
example, ERPs)
• Security
Other S L-H • Security

transactional
• Might be accessed by custom connector or
systems
OneBox module
Security
Security can be the area of greatest complexity in a search deployment. As you analyze
content, understand if it is secured, and if so, how it is secured (forms-protected, cookies,
protected by application-level security, and so on).
In a search solution, security has two main areas of impact:
• Crawling and content acquisition
• Serving and user authorization
For comprehensive information about the search appliance and security, see “Managing
Search for Controlled-Access Content” at http://code.google.com/apis/searchappliance/
documentation/60/secure_search/secure_search_overview.html.
Crawling and content acquisition
The Google Search Appliance can make use of standard security protocols, such as NTLM or
forms-based security.
Understanding all the security permutations will help you plan for content acquisition. For
example, security might have an impact on web and file system crawl that you need to plan for,
such as configuring a proxy or ensuring your Windows file systems have CIFS enabled to
support SMB crawling.
More complex security might require alternative means of content acquisition, such as feeds
or connectors.

You might find that you need to make small adjustments to your environment that enable the
search appliance to crawl and acquire content without needing to use feeds or connectors. For
example, you might find that to extend the crawl to a new subdomain, you need to modify a
cookie domain as the search appliance crawls content to allow cookies to conform to request
for comments (RFC) specifications. These types of changes are typically small and can be
implemented through a variety of methods.
Occasionally, security considerations require making adjustments to the indexing procedure or

using an alternate content acquisition approach. These circumstances might affect aspects of
the solution design that are not directly related to security architecture, such as feeds and
publishing workflows.
For more information, see “Crawling secure content” on page 57.
Serving and user authorization
When serving secured content, the Google Search Appliance first checks that the user is
entitled to see relevant results. If the user is not entitled to view a document, it does not appear
in the result set.
Of course, you can always choose to make results public and apply no security at serve time.
In many cases, search can initially be deployed unsecured, with security added as more
content is acquired. Public search (such as an externally facing internet site) is typically
deployed this way.
In general, deployments with heterogeneous security requirements can be satisfied by using

the SAML Service Provider Interface (SPI), described in “SAML SPI” on page 58. The SAML
SPI is responsible for managing authentication and authorization checks across diverse
systems and protocols. This capability provides great flexibility in how security will be
implemented. You can:
• Purchase pre-built providers (see the Google Enterprise Solution Marketplace for
examples)
• Build a custom solution
• Use one of many SSO providers so long as they support SAML
With the release of version 6.0, the Google Search Appliance also supports definition of policy
access control lists (ACL), so that authorization checks can be performed against documents
using early binding. Policy ACLs not only enhance performance, but give you more options for
managing security. This new capability also gives you options to phase your secure search
deployment. For information about policy ACLs, see “Access control list caching” on page 59.
For more information about secure serve, see “Serving secure content” on page 57.
Performance and scalability requirements

Non-Functional Requirements (NFR’s) are typically pure technical requirements. In a search
solution, the most common NFRs are:
• Performance, described on page 21
• Scalability, described on page 21

Performance
Performance requirements typically revolve around how fast the solution returns results,
though there may also be requirements around speed of content acquisition. Performance is
typically dependent on a number of factors, including:
• Security requirements
• Content type
• Corpus size
• The type of queries being executed
• Network architecture and performance
• Additional search functions used (for example, query expansion or metadata filtering)
As a rule, if there are specific performance requirements, you should conduct a performance
test early in the deployment to determine changes that may need to be made to the solution
architecture.
Although the Google Search Appliance itself cannot be modified, changes you can incorporate
into your planned deployment include:
• Configuring policy ACLs to improve serve-time security checking.
• Deploying a reverse proxy to cache where possible for common searches. This change is
beneficial only for public (non-secured) content searches.
• Minimizing network traffic between the Google Search Appliance and content sources.
Although this change mostly has an impact on crawl, reduced latency will improve
performance of late-binding authorization.
• Improving perceived performance through responsiveness optimizations (for example,

AJAX). For example, by displaying a progress “spinner” to create the perception of
responsiveness.
• Deploying additional search appliances to spread the load. This change reduces the
demand on any single search appliance and helps ensure that capacity is not a
constraining factor.
See “Architecting for scale and performance” on page 48 for further discussion of
performance-driven search architecture.
Performance requirements should also take crawling and indexing into consideration. Search
appliance indexing adds load to your content systems. If there are specific times of the day in
which the content systems must not be affected, then you need to understand this so that you
can configure search appliance host load schedules accordingly. Furthermore, if the content
system is sufficiently strained, or is particularly slow, you might consider content feeds as an
alternative.
Scalability
Scalability requirements typically revolve around number of queries per second (QPS) or
queries per minute (QPM). As with performance, the QPS that the solution supports depends
on the security requirements, content type, query type, network performance, and a host of
other factors.

Google recommends that where scalability requirements exist, you first re-ratify the
requirements—ideally with metrics derived on current searches. In many cases, the scalability
requirement is not as high as first stated.
While search solutions can be designed to support hundreds of queries per second, in
practice, this is not usually required. The kind of scalability requirements needed from a
search solution are substantially different from those of a transactional system.
For more details about designing a search solution for increased scalability, see “Architecting
for scale and performance” on page 48.
For information about the number of concurrent connections that the Google Search
Appliance can accept, see “Designing a Search Solution” at http://code.google.com/apis/
searchappliance/documentation/52/troubleshooting/Designing_Search_Solution.html#Queueing.
Administration and reporting requirements

Reporting is an important part of any enterprise application, and search is no exception. In
addition to the search reporting requirements described in “Using reports to enhance the
search experience” on page 91, identify other reporting requirements. In particular, pay
attention to Non Functional Requirements (NFR’s). Requirements to consider include:
• The analytical technology to be used (for example, Google Analytics, Advanced Search
Reporting, or some other third-party tool)
• Reporting frequency and distribution
• Other reporting types that may be required (for example, administration events)
Make sure that you understand the business processes that will use these reports. For
example, you should understand the use cases for your reporting requirements and make sure
that the reporting strategy will deliver on them.
Identifying phases
Most search deployments fall into one of the following categories, listed from simplest to most
complex:
• Specialized deployments focused on delivering a familiar, powerful search experience to

customers of an organization’s public or secured externally-facing information.
• Stand-alone search deployments focused on providing general productivity gains to

enterprises and making better use of information assets.
• Search deployments driven by a compelling event or larger deployment, such as

implementing a new portal, delivering an Enterprise Content Management (ECM) system,
or launching a new Information Architecture project.
A search deployment typically targets quick wins to deliver a rich search experience to users
rapidly, with incremental, iterative delivery of additional value over the life of the search
deployment.

Business value is derived from the breadth of content over which the search capability is
delivered and the usability and effectiveness of the search experience, as illustrated in the
following figure.
Deployment phases
The key to successful search deployments is to deliver early and deliver often. Don’t try to do
everything at once. Your users will benefit from getting access to the content they want as
early as possible. Delivering early means quick wins that can help drive support with your
stakeholders and generate excitement and visibility with your users.
Phase scope could be defined in terms of:
• Content sources
• Security
• User groups
• Usability features
Each phase should include an evaluation task, where you explicitly evaluate user satisfaction,
and feature requests. As always, evaluate feature requests, including risks associated with
implementing—and not implementing.
In general, since each phase is of relatively short duration, you can use most delivery
methodologies, ranging from Agile to Life Cycle.

If you are using a more classical variety of development methodology, keep in mind that the
development phases of a Google search project are relatively short. In this case, you need to
make adjustments so you can effectively deliver a quality search experience in a flexible
manner. Many of the technologies that you will use (Extensible Stylesheet Language
Transformations—XSLT—stylesheets, OneBox modules, and so on) can be quickly
implemented and rapidly adjusted. You need to have flexibility in your approach to prototype
rapidly and iterate on deliverables.
This section discusses how you can structure your deliverables and project plans to broaden
the search footprint and increase use of your search solution. Each delivery moves your
deployment further along the value curve.
Where to start
The Google Search Appliance is designed to be rapidly deployed over core content sources.
Leveraging open standards and protocols allows rapid integration of content from a variety of
sources and implementation of rich usability features, such as Search-as-you Type, user-
added results, and dynamic results clusters.
Phases can be as short as a week or two or as long as a month. Google recommends that you
structure your program of work to aim for shorter phases, with rapid delivery of iterative
functionality, content, or user groups.
In many cases, a single rapid delivery phase is all that is required. However, even when your
deployment is part of a longer running, comprehensive program of work delivering universal
search across all your enterprise assets, you should still structure your phases to deliver quick
wins.
Before you commence your search deployment, complete the following core tasks, so that
your search deployment specialist can get your search appliance up and running as quickly as
possible.
Before you start • Rack the search appliance.

• Configure network settings.
• Inventory your content sources (including document
count).
• Inventory your security systems.
• Configure your network to allow the search appliance
access to all content sources, and if required, restrict
access to secure areas.
• Create any user ID’s needed by the search appliance to
crawl content.

Phasing your activities
You might consider phasing your activities as described in the following sections:
• “Early development” on page 25
• “Incremental releases” on page 26
• “Advanced delivery” on page 27
Early development
Delivery items listed in the following table are typically relatively quick and easy to deliver.
Consider them as candidates for early development. Many of these could be considered
mandatory—a custom front end for example, no matter how simple, should always be a part of
the core delivery.
Candidates for early development
Delivery Item Type Complexity
Basic HTTP crawl: Content sources Low
• Intranet
• Extranet
• Website
• Wiki
• Web-enabled knowledge bases (for example,
Lotus Notes)
File system crawl: Content sources Low
• SMB crawl of shared drives

• HTTP crawl of HTTP-enabled drives
SharePoint sites Content sources Low to medium
Basic OneBox modules (for example, Content sources Low to medium

PeopleFinder)
Lightweight Directory Access Protocol (LDAP) Security Low

authentication
Kerberos integration Security Medium
Query suggestions/Search-as-you-Type Usability Low to medium*
User-added results Usability Low
Custom front end Usability Low to medium
Advanced search reporting Usability Low

Candidates for early development
Primary system users User groups Low
Business owners User groups Low
*Depending on whether you use out-of-the box features or a customized implementation.
Complexity may vary depending on your infrastructure and environmental configuration.
Incremental releases
Delivery items listed in the following table are candidates for incremental release. Consider
these items and schedule their deployment according to priority (typically based on volume of
content, and business criticality), and level of effort.
In many cases, you can accelerate delivery by using third-party tools (such as connectors) and
certified Google Enterprise partners, who are experienced in Google Search Appliance
integration issues. Some of these delivery items (for example, customized advanced search)
might require some user feedback before full implementation.
Candidates for incremental releases
Portal content Content source Medium to high
Non-web-enabled knowledge bases (for example Content source Medium

many Lotus Notes Databases)
Content Management Systems Content source Low to high
Custom OneBox modules (may be secured) Content source Low to high
Custom application content Content source Low to high
Additional connectors (FileNet, Livelink, Content source Low to medium

Documentum)
Customized advanced search Usability Low
Advanced usability features (for example, AJAX- Usability Low to medium

driven user interface)
Cross-language translation Usability Low to medium
Additional users dependent on new content User groups Low to medium

Advanced delivery
Delivery items listed in the following table are candidates for advanced delivery. Advanced
Delivery candidates might require more time or effort to implement or they might not be
required at all. If these items are part of your search deployment, you can implement them in
parallel with other deployment tasks. This way, you can get users up and running with core
content immediately.
In some cases, items are structured data sources that require analysis before understanding
how best to integrate into the search experience (for example, Business Intelligence
platforms).
Candidates for advanced delivery
Advanced security (including Policy ACLs) Security Medium to high
SAML SPI (single sign-on) provider Security Medium to high
Record management systems Content sources Medium to high
ERP systems (SAP, Oracle, PeopleSoft) Content sources Medium to high
CRM systems (for example, Siebel) Content sources Medium to high
Data warehousing/BI platforms Content sources Medium to high
Other Line of Business systems Content sources Medium to high
How long should phases be?

In general, phases should last anywhere from a few days to a few weeks. Although work
efforts vary and require specific estimates, the duration to complete tasks can be derived from
complexity, as listed in the following table.
Complexity Duration
Low 2-8 hours
Medium 1-5 days
High 1-4 weeks
The times in this table are guidelines only and will vary, based on your environment and
requirements. Google recommends that you perform an analysis to determine the work effort
specific to your deployment.
In addition to the work effort, you need to allow enough time to acquire content. Strive for
having as much content in the index as possible from targeted content sources. This is not to
say that you should wait until you get every possible content source into your search solution,
but rather that you should have in the index all the content from the systems you are
incorporating in the current release.

It is challenging to predict how quickly the Google Search Appliance will acquire content, as
the rate of acquisition is dependent on a number of factors, including:
• Network performance
• Server performance
• Host load
• Content type
Google recommends running some tests early in the project life cycle to determine content
acquisition speed. Use this information to help you plan accordingly.
Defining success criteria

Before you commence your project delivery, define what constitutes your success criteria, so
that you have a clearly defined set of acceptance criteria. Typical success criteria for search
deployments include:
• User-executed assessments of relevance (for example, user ratings)
• Security tests (authentication and authorization is working for all secured systems)
• Breadth of content (volume of content is in the index—for example, 95% of content in a

system)
• Breadth of roll out (percentage of users activated)
Transitioning to business as usual

Business As Usual (BAU) tasks are the regular tasks that are carried out to operationalize the
delivered solution. Unlike other search solutions, the Google Search Appliance does not
require constant adjusting and tuning of the algorithm, nor a dedicated team to do so.
However, as with any enterprise solution, there are some tasks that should be carried out
regularly. These are discussed in “Post Deployment” on page 83. You need to plan your
resourcing to manage these tasks, as the operational team who will be responsible for BAU
may not be the same as the team who deployed.
Perform tasks in preparation for transition to BAU as described in the following sections:
• “Document standard operating procedures” on page 29
• “Document your support arrangements” on page 29
• “Prepare your analytic solution” on page 29
• “Configure Monitoring” on page 30
• “Export and back up your configurations” on page 30
• “Transition user enrichments” on page 30

Document standard operating procedures
When you deploy search to production, you introduce some new operational processes.
Document these processes and transition them to the BAU team. Typical processes include:
• Troubleshooting your Google Search Appliance and environment
• Raising a support ticket (see https://support.google.com, password required)
• Executing an emergency failover to a hot standby
• Creating and managing administrator and manager roles on the search appliance
• Managing KeyMatches, related queries, and query expansion synonyms
• Any processes around additional technologies (for example, OneBoxes modules, SAML
providers, and so on)
• Migrating code assets and configurations from your development environment to your
production environment
Document your support arrangements

Make sure your BAU team knows the support arrangements for your Google Search
Appliance. When contacting Google Enterprise Support, the BAU team should have the
following information available:
• Your search appliance IDs
• Your login details to https://support.google.com
• Remote access details for chosen methods (SSH configuration and routing, support call,
and so on)
• License information
• Google/Partner support contact information

In many cases your Google Enterprise partner will be a very effective contact for resolving
challenges.
This preparation allows for efficient use of Google Enterprise Support, should you need it.
For details, see “Optimizing support” on page 89.
Prepare your analytic solution

Configure advanced search reporting or another analytical engine, so that user searches and
behavior can be reported on and analyzed. The analysis can be used to enrich and enhance
the search experience.
You can also output logs to a syslog server to leverage third-party log processing tools that
you might already have in use.

For information about advanced search reporting, see “Gathering Information about the
Search Experience” at http://code.google.com/apis/searchappliance/documentation/60/
admin_searchexp/ce_improving_search.html#gather.
Configure Monitoring
Establish a method for monitoring your Google Search Appliance. You can use SNMP, or
some of the monitoring tools discussed in “Designing a Search Solution,” at http://
code.google.com/apis/searchappliance/documentation/60/troubleshooting/
Designing_Search_Solution.html#Monitoring.
You could also monitor your search appliance by using a custom solution. Anything that allows
you to monitor your search appliance actively will give you additional confidence and stability
in your deployment, and will allow you to identify problems early.
Export and back up your configurations

Export configurations from all your Google Search Appliances, as well as any code from other
assets. Store them in a version control system.
Transition user enrichments

Make sure that the BAU team is familiar with the user enrichments made by the business,
such as KeyMatches, related queries, and so on. These user enrichments need to be adjusted
over time as the required effects change.
For example, when a new policy document or product is launched, KeyMatches relating to the
old version may need to be updated. The BAU team needs to be aware of them, and the
appropriate business owners.

Project Scenarios Chapter 4
The project scenarios in this chapter illustrate how a successful deployment might be
executed. These scenarios include:
• Basic search on a public website, described on page 32
• Basic internal search, described on page 35
• Internal search over intranet, file system, SharePoint, and Notes, described on page 38
• Internal search including CMS, database, corporate application assets, described on

page 40
Each of these scenarios is based on the following assumptions:
• The deployment team is familiar with the Google Search Appliance. If required, a certified
Google partner can help.
• There are no significant problems in the deployment environment. All environments differ,
and yours may have unforeseen complexity.
The time lines and project plans used in this document, while examples, should not be taken
as reference plans. Your own time lines might reflect greater complexity. When you plan a
deployment project, take specific business or technical requirements into consideration.
Always include contingencies in your plans.
31
Basic search on a public website
In the use case for this project scenario, Alpha inc. is deploying search over a public-facing
website containing a massive amount of information about products sold in their retail stores.
Most of the content is public, but there is also protected content in a secured members section
for customers who have purchased a product and registered it. While all users search for
public, product content, members might also search for protected content, such as support
information.
Scenario summary
Content sources • General corporate information

• Product Catalog (published to the site as navigable
product pages)
• Support documents (frequently asked questions (FAQs)
and .pdf files containing product information)
Key requirements • Index all core content.

• User interface (UI) must be standardized, and
maintained to be consistent with look and feel changes
to the site.
• Secure “members” content must only be accessible to
users authorized to see it.
• Understand user activity to increase stickiness and
“conversion” rate.
Key decisions • Present results directly from the search appliance or by

means of a web application presentation layer?
• Manage security by using the search appliance or by
means of an application?

Chosen approach • Present results from a web application, using existing
templates.
• The search appliance will delivers result to the web
application in XML form. The web application will
process the results and render them seamlessly on
the website.
• Many organizations will render results directly from
the search appliance by using XSLT.
• Crawl content by using a forms authentication rule for
secured content.
• Create four searchable collections:
• All public content
• Products
• Members-only content (support material)
• All content
• Manage security at the application level:
• Mark secured content as public.
• The application will search only against collections
containing documents that a user is entitled to see.
• Users who are not logged in will search only against
public content.
• Users who are logged in will search by default
against both public and members-only content.
• If desired, users will be able to choose specialized
product search or support information search by means
of basic user interface (UI) widgets, such as check boxes
or drop down lists.
• Use Google Analytics to track not only search activities,
but also broader site usage.
Possible architectures • High availability load balanced architecture—potential

large volume of searches requires the system be able to
handle excessive load.
• Implement LDAP security, or provide application-level
authorization to search to limit access to secure
information so that site members can access authorized
information after authenticating themselves.
Project Scenarios 33
Project plan
The following figure shows a generalized gantt chart for deploying basic search on a public
website.
Basic search on a public website project plan
Enhancements
The initial deployment should also be followed by a set of rapid enhancements with short
delivery cycles. Enhancements include:
• Product OneBox module—retrieves product pricing and availability directly from supply
chain system. When a user searches for “gadget,” they will also get the price, and
availability of gadgets in real time.
• Store locator OneBox module—for logged in users, this OneBox module could retrieve
information about stores within 10 miles of them and display it by means of Google Maps.
• Related product OneBox module—retrieves supplemental product information based on

data mining from the company’s business intelligence (BI) platform to help drive additional
sales.
The following figure shows a generalized gantt chart for enhancement phases for deploying
basic search on a public website.

Enhancement phases
Basic internal search

In the use case for this project scenario, CorpCom LLC has a fairly extensive internal web
presence that extends out to different parts of the globe. They want to consolidate the
searching of all their internal websites and pages to one place so their employees will not have
to go to different websites to search for information.
Although all the users are employed with CorpCom, not all of them have access to all the
information on the various sites in their corporate domain. For example, Human Resources
(HR) information access is desirable by means of the search, therefore securing personal
information is important.
Scenario summary
Content sources • Corporate file shares

• Internal U.S. web pages
• Foreign web pages, with different country content under
different folders. For example:
• http://intranet.corpcom.com/us/content for U.S.
• http://intranet.corpcom.com/fr/content for France
• HR information
Key requirements • Index all pages that are web accessible.
• Index foreign language content.
• Seamless sign-on.
• A standard search page where employees go to search
for information.
• Secure content must only be accessible to users
authorized to see it.

• Present foreign language content directly from the
search appliance or by means of a web application
presentation layer?
• Use Advanced Search Reporting for on-board analytical
capabilities?
Chosen approach • Present results directly from the search appliance by

using existing templates.
• Since the majority of the users will be searching English
language content solely, foreign language content will be
identified as a later enhancement to be implemented by
using different front ends.
• Some languages, such as Scandinavian, will be
deployed with Language Bundles (a release 6.0
feature).
• Crawl content using LDAP authentication for secured
HR content.
• Foreign content will be segmented in separate
collections, based on language type.
Possible architectures • Federated architecture—Search integration of disparate

data stores with the need of replicating indexes across
different departments/groups.
• Implement Kerberos security to limit access to secure
information—all users will have network accounts which
complements using integrated authentication and
authorization with Kerberos.

Project plan
The following figure shows a generalized gantt chart for deploying basic internal search.
Basic internal search project plan
Enhancements
The initial deployment should also be followed by a set of rapid enhancements with short
delivery cycles. Enhancements include:
• Collection--crawl foreign language sites and store them in a separate collection.
• XSLT template--develop a foreign language presentation XSLT template.
• HR OneBox module--retrieves directory information for employees.
basic internal search.
Enhancement phases
Internal search over intranet, file system,
SharePoint, and Notes
In the use case for this project scenario, Cybertron Appliance Inc. houses different data
corpora that are being served up on different servers on their corporate network. These data
silos are accessed by way of different data management applications such as SharePoint,
Lotus Notes Databases, as well as secure file shares.
Having to go to different applications to find information has become tedious and very time
consuming for their employees. Not only that, the loss in productivity in trying to locate a
particular piece of information has started to show up on their bottom line because of the
repetitive switching between disjoint systems to search for information and ineffective existing
search tools.
Scenario summary
Content sources • Secure file shares

• SharePoint portal data used to host internal sites
• Lotus Notes Domino data in business databases
Key requirements • Index each individual data silo keeping content secure.
• Create standard default UI for data access.
• Create custom interfaces for internal and external users.
• Deployment must result in a measurable business
benefit.

• Provide documentation so that each group can
incorporate search functionality into their own custom
application/website or use only the search appliance’s
default search and results pages?
• Integrate a connector to access Lotus Notes or web
enable databases?
• Use Google SharePoint connector?
• Crawl SMB file share?

Chosen approach • Conduct a short study to capture time spent on existing
platforms, in parallel with deployment.
• Conduct a post-deployment evaluation of the new
solution to evaluate its effectiveness.
• Implement analytics for ongoing evaluation of
effectiveness.
• Present results directly from the search appliance by
using customized front ends for different data stores.
• For SharePoint users the Google Search Box for
SharePoint will be used.
• Implement SharePoint connector early to access data in
SharePoint portal.
• Crawl SharePoint and secure file share content using
Kerberos Authentication.
• Use Google Search Box for SharePoint to deliver
integrated search experience.
• Index Lotus Notes database using a connector.
Possible architectures • Federated architecture—Search integration of disparate

data stores with the need of replicating indexes across
different departments/groups.
• Implementation of integration architectures:
• SharePoint Connector for SharePoint Portal
• Content and Metadata feeds for Domino DB
• Implement SAML SPI Architecture for heterogeneous
security.
Project plan
The following figure shows a generalized gantt chart for deploying internal search over
intranet, file system, SharePoint, and Notes.
Internal search over intranet, file system, SharePoint, and Notes project plan
Enhancements
internal search over intranet, file system, SharePoint, and Notes.
Enhancement phases
Internal search including CMS, database, corporate

application assets
In the use case for this project scenario, Sisyphus Work Force is a large labor organization
that employs people from around the world. Their information is scattered around various
kinds of systems that do not communicate readily with each other. They also have a
substantial internal corporate web presence with content that is served by a CMS.
Their internal HR system is a large database repository and they have various commercial and
custom applications that allow users to gain access to data based on different access
methods. Directory information (for example contact details, manager and direct reports),
performance reports, and salary information are stored on this system.

Scenario summary
Content sources • Intranet websites

• Shared filed system
• Oracle database
• CMS
• Custom application
Key requirements • Index each individual data silo, keeping content secure.
• Create standard default UI for data access.
• Create custom interfaces for different groups in the
organization.
Key decisions • Manage security by using the search appliance or by

means of an application using Active Directory?
• Present results by using the search appliance default
front end or by means of a secondary application?
• Use a phased approach for system deployment?
Chosen approach • Deploy to initial pilot group prior to full rollout by means
of a corporate portal.
• Crawl and serve secure content (for example, HR or
Salary information) using LDAP.
• Manage security at the application level.
• Initially index a selected cross-section of data holding
with additional documents to be added later.
• Present results directly from the search appliance by
using the default XSLT style sheet.
• Due to the diversity of content and source, use a phased
approach for deployment.
• Intranet sites and file share will be in initial
deployment.
• Database feeds for Oracle HR system will follow.
• CMS systems and related portals will be next.
• A survey of corporate applications that house and
serve data will be conducted and a determination
will be made on which will be accessed for search.
Possible architectures • Federated high availability deployment architecture with
disaster recovery capability—Search integration of
disparate data stores with the need of replicating
indexes across different departments/groups while
ensuing virtual 24/7 up-time so that productivity is not
lost.
• Implementation of integration architectures:
• Content and Metadata feeds for CMS
• Custom connector for database search
• Implement SAML SPI Deployment or policy ACL
deployment to handle diverse security or poorly
performing systems.
Project plan
The following figure shows a generalized gantt chart for deploying internal search including
CMS, database corporate application assets.
Internal search including CMS, database corporate application assets project plan
Enhancements
Multiple short iterative enhancement phases deliver incremental functionality, delivering new
content to your users, and allowing opportunities to increase visibility and drive uptake with
new users. The following figures show gantt charts for enhancement phases.

Enhancement phases project plan A
You can begin by including unsecured database content by means of a database feed, crawl
any additional content that still needs to be acquired, and then release to the primary user
groups.
Enhancement phases project plan B
Now that your users are searching across their information, the next phase is to rapidly build a
method to feed content from your CMS to the Google Search Appliance.
Enhancement phases project plan C
And finally, you can begin to consume content from your corporate applications, in short,
phased migrations. These can be planned and repeated as needed to deliver true Universal
Search. Note that phases may have longer durations where security integration is required.
Deployment Architecture Chapter 5
This chapter discusses the following technical and architectural considerations for planning
your deployment:
• Sizing the index, described on page 45
• Architecting for scale and performance, described on page 48
• Architecting for reliability, described on page 50
• Architecting for reach, described on page 52
• Architecting for security, described on page 56
• Enhancing technologies, described on page 60
For examples of architectures that address common deployment scenarios, see “Deployment
Scenarios” on page 61.
Sizing the index

The process of sizing your Google Search Appliance index consists of the following activities:
• Scoping index capacity needs, described on page 46
• Determining what to index, described on page 47
• Determining how to index, described on page 48
45
Scoping index capacity needs
Google Search Appliance models are:
• GB-7007—can index up to 10 million documents
• GB-9009—can index up to 30 million documents out of the box. For larger deployments,
multiple GB-9009 appliances can be linked together to search hundreds of millions or
even billions of documents.
From a sizing perspective, Google recommends that organizations choose a base unit that
meets the current document capacity needs, as well as projected document growth needs for
two years.
Google Search Appliances are designed to be flexible, so if a model upgrade is required at

any time, there is a seamless migration plan that allows for the installation and transition to the
new unit with no service downtime.
However, because upgrading requires a hardware change, if the current document capacity is
close to the physical indexing limits of the GB-7007, Google recommends selecting the GB-
9009 to simplify management of the solution over time.
The Google Search Appliance is also designed to operate intelligently up to the license limits
within each respective model to ensure the most optimal user experience. When the license
limit is reached in a given model, the search appliance continues to discover relevant
documents outside the license limit in an effort to maintain a servable index of the most
relevant documents found in the environment. However, this will create churn while less
relevant documents are removed in favor of more relevant ones. If your search appliance is
nearing its license limit, consider upgrading to a higher document count.
This process of continual discovery and analysis beyond the license limit provides an
automated and intelligent method of managing the search experience when operating in an
environment where more documents are available than the license limit allows.
However, the search appliance’s automated pruning logic could cause certain critical content
to be excluded from the index to make room for more relevant content. If mission-critical
content exists beyond the license limit, Google recommends expanding the license limit to
ensure that all the relevant content can be indexed and served with additional room to grow.
For a discussion of the choice between a upgrading search appliance upgrade and deploying
additional hardware, see “Scale up/scale out” on page 49.
Dynamic scalability
Dynamic scalability is a release 6.0 feature that enables multiple Google Search Appliances to
work together in a federated environment to scale up to as many documents as you wish to
search in a unified manner.
In a dynamic scalability configuration, one search appliance is the primary node and the
others are secondary nodes. The primary search appliance aggregates results from all of the
search appliances in the configuration and serves them to the search user. The primary
search appliance's front end is used for searching all document corpora in the dynamic
scalability configuration.

Any model of the Google Search Appliance running software version 6.0 or later can be
configured to participate in a dynamic scalability configuration. The configuration may include
different search appliance models, provided they are all running the same software version.
For information about dynamic scalability, see “Configuring Dynamic Scalability” at

http://code.google.com/apis/searchappliance/documentation/60/dynamic_scale/
dynamic_scale.html.
Determining what to index

Determining what to index is a process of identifying content sources, which typically include:
• Corporate websites
• Partner extranets
• Corporate intranet sites
• Portals
• Knowledge bases
• Content and Document Management Systems
• File shares
This process might seem straightforward. However, you might uncover more information
within a given content source that needs to be indexed than you originally anticipated. For
example:
• A file system might be much deeper than expected
• A Content or Document Management System might contain more documents than

originally anticipated
• A website might have more pages than expected
The Google Search Appliance provides capabilities to limit content indexing by implementing
simple content acquisition rules (follow and crawl URLs). Limiting the index scope by adjusting
these rules can be an ongoing discovery process that needs to be taken into consideration,
especially when the content sources targeted for search are not well-maintained or are
managed in a decentralized fashion.
The search appliance provides detailed logs on each document that has been indexed and
also provides summary information on document types and sizes. Crawl Diagnostics features
allow administrators to fine tune follow and crawl URLs to ensure that the most relevant
content is being indexed and served at any time.
Deployment Architecture 47
Determining how to index
Once you have identified content sources to index, you need to take the method of indexing
each one into consideration. The Google Search Appliance can use several methods to
acquire content for indexing, including:
• Web crawl
• Database synchronization
• Feeds—both content and Metadata and URL
• Connectors to third-party applications
Determining the most effective method of indexing depends on the content sources that need
to be indexed.
For example, corporate websites, partner extranets, wikis, corporate intranet sites, and
informational portals can often be easily indexed by using the search appliance's crawling
technology. The crawl process issues HTTP requests or follows links to locate content on a
website or file system. To configure crawling, an administrator follows a simple process of
defining URL rules in the search appliance's simple-to-use web-based Admin Console.
For comprehensive information about crawl, see “Administering Crawl for Web and File Share
Content at http://code.google.com/apis/searchappliance/documentation/60/admin_crawl/
Introduction.html.
For information about database synchronization, see “Database Crawling and Serving” at
http://code.google.com/apis/searchappliance/documentation/60/database_crawl_serve.html.
For integration into Content and Document Management Systems, knowledge bases, and
collaboration tools, such as Microsoft SharePoint, using feeds or a connector might be the
most effective method for indexing content. For more information, see “Feeds” and
“Connectors” on page 55.
For complex deployments where search spans multiple information sources, consult a Google
product specialist or Google Enterprise partner to determine the optimal methods of indexing.
Architecting for scale and performance

Almost every organization continually produces content. Indexing additional content adds
value to your search solution deployment. To accommodate additional content in a search
appliance deployment, you need to architect for scale as described in the following section,
“Scale up/scale out.”
Similarly, organizations might need to support additional users and search loads over time. To
accommodate this type of growth, you need to architect for performance, as described in
“Load balancing” on page 49.

Scale up/scale out
Based upon the amount of content to be indexed, consider whether to scale up or scale out.
To scale up is to upgrade to a larger capacity model of the Google Search Appliance. To scale
out is to configure two or more search appliance models in a federated environment, as
described in “Dynamic scalability” on page 46.
For example, if you require 25 million documents to be indexed, should you use one GB-9009
or three federated GB-7007s? The answer to this question is dependent on a number of
factors, including, but not limited to the following issues. These items are in no particular order.
The factors important to your deployment may be completely different from those important to
another deployment.
• How much rack space is available or are there power restrictions in the data center?
• For limited rack space or power restrictions, Google recommends that you choose a more
powerful search appliance model instead of multiple, federated search appliances.
• Is a hot backup required (increased cost for more servers)? Each hot backup has a fixed
cost, so if you require multiple hot backup servers, the cost might be greater than one
individual, but larger, search appliance. Alternately, a deployment made up of many lower
capacity servers might be more costly than a single larger one. Therefore, investigate this
issue before deciding on the type and number of search appliances that will be used in the
solution, as total procurement cost, including production and hot backups may be subject
to change.
• Are there multiple secure repositories that have to be indexed?
• Are there multiple departmental owners who want to control their own search service? In
some instances, individual content owners prefer to own their own search appliance. If this
is the case, then a dynamic scalability configuration using multiple search appliances
would be the solution.
• Deployment of additional search appliances may enable you to execute crawling in

parallel, increasing acquisition and renewal of content.
• Deployment of multiple nodes offers trade-offs to be considered. While multiple

appliances may require additional management and support, this also means in many
cases, a single node may be taken offline without disrupting the rest of the search
deployment. This is not typically the case with a single node deployment.In general, if a
single search appliance can provide for the corpus size, use one search appliance, if
possible.
To read about deployment scenarios that use dynamic scalability, see “Federated
architecture” on page 79.
Load balancing
Load balancing distributes network traffic of a particular type to two or more instances of an
application, dividing the work load between the instances. A load balancer is a software or
hardware application that distributes the network traffic. When you configure two or more
Google Search Appliance systems for load balancing, search queries are distributed between
the two systems.
Determining whether a load balancer is required is dependent on a number of considerations,
such as:
• The peak load of queries that the search appliance will receive
• The number of users who are going to be using the system
• Where the users are located
A large number of queries per second at peak time, or a very diversely located user base,
generally requires multiple search appliances using load balancing to help serve the results at
an acceptable rate. Load-balanced search appliances also provide a level of redundancy that
is not possible with a single search appliance.
You can setup search appliances in the following configurations:
• A single search appliance on a network with no other search appliance for failover or fault
tolerance. This is not a load-balanced configuration.
• A load balancing configuration in which there is a physical connection between the search
appliances and the load balancer and each search appliance is on the same network or
subnet as the load balancer.
• A load balancing configuration in which there is a logical connection to the load balancer
and each search appliance is potentially on different networks or subnets from the load
balancer.
• A failover configuration in which a switch fails over search queries from the search
appliance that normally responds to search queries to a search appliance that does not
normally respond to search queries and is used only for failover. For more information,
see “Failover configurations” on page 51.
Note: In each of the above configurations, each search appliance could be one or more
search appliance in a federated deployment.
Load balancers can be used with virtually any architecture, such as Federated high availability
deployment architecture, described on page 80.
Google does not recommend specific load balancers to use with the search appliance. The
configurations described in this document are expected to work with any equipment that
complies with networking RFCs.
To read about deployment scenarios that use load balancing, see “High availability
architecture” on page 65.
For information about load balancing, see “Configuring Search Appliances for Load Balancing
or Failover” at http://code.google.com/apis/searchappliance/documentation/60/configuration/
Configuration.html.
Architecting for reliability

The Google Search Appliance often provides mission-critical search capabilities for
applications, such as:
• Public-facing retail websites

• Web-based application services
• Supplier and partner extranets
• Customer service center applications
For any application where Google Search Appliances are providing mission-critical search
capabilities, Google recommends a high availability configuration to provide seamless
operation in the event of a system failure.
High availability is the deployment of multiple search appliances in a configuration where if

one appliance fails, a secondary appliance (or appliances) is able to fail over and resume
service seamlessly with minimal, if any, disruption.
Failover configurations
Failover configurations typically involve two instances of an application or a particular type of
hardware. The first instance, sometimes called the primary instance, responds to search
queries. If the first instance fails, the second instance, sometimes called the secondary or
standby instance, starts responding to search queries.
One such implementation is a domain name system (DNS) switchover configuration that
provides a redundant "hot spare.” This configuration involves multiple search appliances,
where one is used in production and the second one is used as a hot spare. These search
appliances can be located anywhere, physically or logically.
The DNS switchover may be automatically executed in the event of a failure. This switchover
can be executed manually, but it typically results in a more extended outage, due to the need
to wait for manual execution, and, depending on your environment, time for DNS changes to
propagate.
Changes are made in DNS to restore the search if the primary search appliance becomes
inaccessible. This setup is only used for redundancy (or failover) and does not provide a
method of load balancing.
To read about deployment scenarios that use failover, see “High availability architecture” on
page 65.
For more information about failover, see “Configuring Search Appliances for Load Balancing
or Failover” at http://code.google.com/apis/searchappliance/documentation/60/configuration/
Configuration.html.
Active/Active vs. active/passive failover

Once you determine the appropriate Google Search Appliance model based on index capacity
needs, you can begin to scope your query throughput needs. Each model of the Google
Search Appliance has been tested to meet a query volume that is suitable for most internal
corporate network search requirements, as well as most public-facing, site-search
requirements.
However, some organizations have needs to scale beyond one search appliance. For these
instances, multiple search appliances can be deployed in parallel to scale in a linear fashion
for query volume.
By determining the number of queries, you can determine if you require:
• An active/active setup, in which two appliances are set up and serving results concurrently
• An active/passive failover setup for fault tolerance, in which two search appliances are set
up, with one serving results and the other to be used only in the event of a failure on the
primary search appliance
Architecting for reach

A great search experience depends on a number of factors, with a powerful relevance
algorithm as one of the most important. The Google Search Appliance takes care of
relevance, so you can focus on another of the most important factors: making sure your
search solution can reach and search over all your organization’s most relevant content,
wherever it is.
Not all content can be accessed and discovered by crawling. To make sure that this content is
in your index and searchable, you might need to use the following integration technologies:
• OneBox modules, described in the following section
• Feeds, described on page 54
• Connectors, described on page 55
OneBox modules
The name "OneBox" refers to the search box that provides access to information from many
sources. OneBox also refers to the formatted output that appears in response to specific query
keywords. OneBox modules are a powerful tool at your disposal for increasing the breadth of
content in your search deployment.
The following figure shows the OneBox module that appears when a user searches for
“finance.”

OneBox module
OneBox Modules enable a Google Search Appliance to integrate with third-party systems in
real time. OneBox modules supplement Google’s powerful algorithmic search with purpose-
built, targeted data retrieval. They enable the search appliance to display this information to
users in the same context as their algorithm-driven search results.
Implementing a OneBox module consists of the following tasks:
1. Building and deploying a lightweight web service
2. Configuring the search appliance so that it is aware of the service and knows when to call
to it.
When the search appliance receives a query that the OneBox can help with, it passes the
query to the OneBox service provider, which extracts the information from a third-party system
and returns it to the search appliance as XML.
To read about a deployment scenario that uses OneBox modules, see “OneBox integration”
on page 71.
For more information about OneBox modules, see the following documents:
• “Google OneBox for Enterprise Developer’s Guide” at http://code.google.com/apis/

searchappliance/documentation/60/oneboxguide.html
• “Google OneBox for Enterprise Design Principles” at http://code.google.com/apis/

searchappliance/documentation/60/oneboxstyle.html
Feeds
Feeds are used to push data into or delete data from the index on a Google Search Appliance.
To push content to the search appliance, you require a feed and a feed client:
• A feed is an XML document that tells the search appliance about the contents that you
want to push.
• A feed client is the application or web page that pushes the feed to a feeder process on
the search appliance.
There are three types of feeds:
• Web feeds
• Content feeds
• Metadata-and-URL feeds
Web feeds
A web feed provides the search appliance with a list of URLs and possibly some metadata.
Web feeds might be used in the following cases:
• A list of URLs pulled from a database that can be fed into the search appliance in order to
continue to crawl them
• Pushing URLs to the search appliance, from a CMS that is HTTP accessible, when they
are published
• Any list of URLs that you want the content recrawled periodically but don’t want to enter
them in the administration console as start URLs
Content feeds
A content feed provides the search appliance with both URLs and their content. Content feeds
can be full or incremental. Content feeds can be used in the following cases:
• Indexing database content
• Indexing content that comes from a proprietary system
• Indexing content coming from a Google Search Appliance connector
To read about deployment scenarios that use feeds, see “Feeds integration” on page 72.
Metadata-and-URL feeds
Metadata-and-URL feeds can be used to provide additional metadata to the Google Search
Appliance. This metadata can be used for searching and for filtering search results. This type
of feed is commonly used in the following cases:
• Providing incremental URLs to the search appliance for crawl
• Augmenting crawled content with metadata from an underlying CMS
For more information about feeds, see the “Feeds Protocol Developer’s Guide” at http://
code.google.com/apis/searchappliance/documentation/60/feedsguide.html.

Connectors
To discover documents in a document repository through an enterprise connector, the Google
Search Appliance does not crawl the content. Instead, it uses a process called traversal, in
which the connector issues queries to a repository to retrieve document data to feed to the
Google Search Appliance for indexing.
Connectors enable the Google Search Appliance to search and serve documents stored in
non-web repositories such as enterprise content management (ECM) systems. Connectors
are installed on a host running Apache Tomcat. A Google Search Appliance that uses
connectors can perform fast, unified, secure search across multiple systems and document
repositories.
Connectors typically also handle serve-time authentication and authorization for the
repositories to which they connect.
Connectors implement an Open Source set of interfaces This means that in addition to the
four out-of-the-box connectors (listed in the following table), you can extend the reach of your
deployment with custom connectors for whatever content source you need.
To read about deployment scenarios that use connectors, see “Connector integration” on
page 73.
To understand concepts and architecture of connectors, read “Introducing Connectors” at http:/

/code.google.com/apis/searchappliance/documentation/connectors/200/connector_admin/
admin_connector.html. The document describes manual steps for deploying and configuring
the connector manager and contains other information common to all connectors.
For more information about individual connectors, see the documents listed in the following
table.
Connector Type Documentation
Microsoft “Introducing the Google Enterprise Connector for Microsoft

SharePoint SharePoint” at http://code.google.com/apis/searchappliance/
documentation/connectors/200/connector_admin/
sharepoint_connector.html
IBM FileNet “Configuring the Google Enterprise Connector for FileNet (3.5)” at
http://code.google.com/apis/searchappliance/documentation/connectors/
200/connector_admin/filenet35_connector.html and
“Configuring the Google Enterprise Connector for FileNet (4.0)” at
http://code.google.com/apis/searchappliance/documentation/connectors/
200/connector_admin/filenet4_connector.html
EMC “Configuring the Google Enterprise Connector for EMC Documentum”

Documentum at http://code.google.com/apis/searchappliance/documentation/
connectors/200/connector_admin/DCTM-connector.html
Open Text “Configuring the Google Enterprise Connector for Open Text Livelink”
Livelink at http://code.google.com/apis/searchappliance/documentation/
connectors/200/connector_admin/livelink_connector.html
Salesforce Google Search Appliance connector for Salesforce page at

(Google http://code.google.com/p/google-enterprise-connector-salesforce/
Enterprise Labs
feature)
Architecting for security

The ability of the Google Search Appliance to index secure content and respect document-
level permissions at serve-time is one of its biggest strengths. The search appliance can work
with many common security mechanisms, the simplest of which are fileshares and web
servers using NTLM or HTTP basic authentication.
A single Google Search Appliance can easily serve public and secure content together, or as
separate collections, and can handle a mix of different authorization schemes. A typical
intranet deployment has a mix of public and secure content on many different servers.
The following sections describe major security considerations:
• “Crawling secure content” on page 57
• “Serving secure content” on page 57
• “Authenticating the user” on page 57
• “Authorizing the user” on page 57
For comprehensive information about the search appliance and security, see “Managing
Search for Controlled-Access Content” at http://code.google.com/apis/searchappliance/
documentation/60/secure_search/secure_search_overview.html.

Crawling secure content
To access secure content for indexing, the crawler needs to be configured with crawler access
rules. These rules contain the username and password to use and a pattern that defines
where this username and password should be applied. When the search appliance indexes
documents that match these rules, it flags the documents as “private.” Consequently, these
documents require end-user authorization before they are returned in search results.
Alternatively, secure content can be flagged as "public" and it will be included in search results
for all users—but this content may not be accessible when the user clicks a link in the search
results.
Serving secure content

Before the Google Search Appliance serves any results flagged as private in a search result
set, it performs the following checks:
• Authenticating the user
• Authorizing the user
Authenticating the user

Authentication is the process of verifying the identity of a user, a system, or a service.
The most common way to authenticate users is against LDAP (or Active Directory). In the
simplest case, a user is prompted for her username and password the first time she searches
on the Google Search Appliance. This authentication establishes a secure session on the
search appliance itself, and the user is not prompted again while the session is active.
Authorizing the user

Authorization is the process that determines whether an authenticated user, system, or service
has permission to perform a task.
Once a user's identity is established, the search appliance can use it to determine which
resources she has access to.
The search appliance executes the search, generates a result set, and then, for any results
that are flagged as private, performs authorization checks. Authorization is checked for one
page of results at a time (usually up to 10 or 20 results), not the entire result set.
In this scenario, the search appliance performs the authorization check by issuing a request
for the document with the user's credentials. The document isn't retrieved, but the search
appliance checks the HTTP response code and, if it is valid, allows the document to be
presented in the results.
Other access-control mechanisms
In addition to NTLM and HTTP Basic, the Google Search Appliance can also work with the
following access-control mechanisms:
• IWA (Integrated Windows Authentication) / Kerberos authentication
• HTML forms-based authentication
• X.509 digital certificates
Integrated Windows Authentication (IWA)/ Kerberos authentication

IWA/Kerberos is a network authentication protocol that enables client and server applications
to perform mutual authentication for the duration of a user's login session. The search
appliance can use IWA/Kerberos authentication to confirm a user's right to view controlled-
access documents. The search appliance only performs this check during secure serve for
content on HTTP servers. IWA/Kerberos is not supported for crawling content.
When the search appliance is configured to use IWA/Kerberos authentication, it checks the
user's session ticket against a Kerberos Key Distribution Center (KDC) before displaying
secure search results to a user. For Windows servers, the domain controller acts as the KDC
for IWA/Kerberos authentication.
• If a user has a valid ticket, he can see secure search results without having to log in again.
• If a user does not have a valid ticket, or the search appliance is unable to perform
Kerberos authentication, the search appliance prompts the user for his credentials using
HTTP Basic or NTLM HTTP.
HTML forms-based authentication

HTML forms-based authentication, sometime called cookie-based authentication, uses a
cookie (or series of cookies) to authorize users. Forms authentication is often used by single
sign-on (SSO) systems, but it is also used frequently on public websites to allow access to
members-only sections of a site.
X.509 digital certificates

The search appliance uses digital certificates when communicating with web browsers and
servers over HTTPS (HTTP secure). The search appliance also supports the use of digital
certificates to perform X.509 certificate authentication to verify a user's identity before serving
secure results.
SAML SPI
The SAML Authentication and Authorization Service Provider Interfaces (SPIs) enable a
Google Search Appliance to communicate with an existing access-control infrastructure by
means of standard Security Assertion Markup Language (SAML) messages. The
Authorization SPI is also required to support X.509 certificate authentication during serve.

SAML is used for serve-time authentication and authorization only. Crawling still needs to be
configured as described in “Crawling secure content” on page 57 or secure content can be
acquired using feeds or connectors.
To read about a deployment scenario that uses the SAML SPI, see “SAML SPI deployment”
on page 76.
For more information on search appliance configuration for use with these SPIs, see “The
SAML Authentication and Authorization Service Provider Interface (SPI)” at http://
code.google.com/apis/searchappliance/documentation/60/secure_search/
secure_search_crwlsrv.html#the_saml_authentication_and_authorization_service_provider_interfa
ce_spi_.
To learn more about the Google SAML Bridge for Windows, see “Enabling Windows
Integrated Authentication” at http://code.google.com/apis/searchappliance/documentation/50/
admin/wia.html.
Access control list caching

A policy access control list (ACL) provides information to the search appliance about which
users or groups have access to a specific URL or pattern, covering a group of URLs. By
specifying policy ACLs on a search appliance, you can enhance search appliance
performance. Policy ACLs speed up the process of authorization and reduce the load on the
authorization servers that occurs from performing HEAD requests to a remote authorization
server.
Policy ACLs typically store the results that would have occurred if the search appliance
initiated a HEAD request to verify authorization. However, policy ACLs can also be used to
override the decision that would have been returned by a HEAD request.
For example, if you put in a policy ACL rule that permits a group to see all documents at a
URL, but at the source repository (that is, the HEAD request), there's a more fine-grained rule
where only some in the group can view documents, then the behavior with the policy ACL rule
is that everyone can see the search results, but only those who have access rights can click
the links.
Policy ACLs can be an effective way to improve serving of results by carrying out authorization
checks more effectively. However, when making the decision to use policy ACLs, take into
account that you will need to manage synchronization to ensure that the latest security policies
are pushed to the search appliance.
You will also need to have a method for the search appliance to understand groups and user
identifiers. If you do not have an LDAP server configured to provide this information, then you
need to push it to search appliance by means of GData feeds.
Policy ACLs require that you use an authentication method to establish the identity of the user
or group that you specify in the policy ACL rules.
To read about a deployment scenario that uses policy ACLs, see “Policy ACLs deployment” on
page 77.
For more information on policy ACLs and secure search, see “Policy Access Control Lists” at
http://code.google.com/apis/searchappliance/documentation/60/secure_search/
secure_search_crwlsrv.html#PolicyAccessControlLists.
Enhancing technologies
The Google Search Appliance delivers core capabilities in a self contained appliance model.
However, these capabilities can be supplemented by additional, non-core technologies.
Off-box capabilities include any technologies that deliver supplemental resources or

functionality that enhance the core search capabilities. Examples include:
• Cascading Stylesheets (CSS)
• JavaScript
• Images
• OneBox modules
• Front end proxying (application-level presentation)
• Connectors
In some cases, off-box technologies may be used to work around non-compliant content
systems, or to enhance or enrich content. This approach typically takes the form of having the
Google Search Appliance crawl through a custom proxy, and might include:
• Content enrichment—Adding metadata or content based on rules to content being

acquired.
• Content filtering—Removing content according to business rules. For example, blocking

documents that contain social security numbers.
• Rewriting headers or cookies to work around non-conforming content sources.
Experimental features from Google Enterprise Labs

Google Enterprise Labs (http://www.google.com/enterprise/labs/) is where Google makes
experimental or early release features available. These features are not supported by Google,
but might be used to enrich and enhance the search experience. For more information, see
“Google Enterprise Labs” on page 100.
Third-Party tools/connectors
Integration with some content systems is typically achieved through connectors or feeds. In
many cases, it is quicker, easier, and may be cheaper to pay a third party for a pre-built
connector. These connectors typically take the form of either implementations of the Google
Connector API, or a feed/SAML SPI combination. If you need integration to a third party
system, you should make a buy vs. build evaluation if prebuilt solutions are available. You can
find these solutions at the Google Solutions Market Place at http://www.google.com/enterprise/
marketplace/.

Deployment Scenarios Chapter 6
This chapter contains descriptions of architectures that address the following common
deployment scenarios:
• Staging/Development environment, described on page 62
• Simple architecture, described on page 63
• Search as a web service, described on page 64
• High availability, described on page 65
• Disaster recovery deployment architecture, described on page 69
• Integrated architectures, described on page 71
• Federated architectures, described on page 79
These deployment scenarios can be merged as needed to achieve complex deployment

requirements.
61
Staging/Development environment
In most environments, it is advantageous to test changes in a separate environment before
releasing changes to end users. As with any type of server or application, a small change to a
configuration could have unintended consequences, so a proper testing strategy and staging
environment is recommended.
A development environment for the Google Search Appliance simply means replicating the
production environment to provide a separate area for testing configuration changes and new
enhancements. The development environment should include access to the same content
types and sources as the production environment, but it may include a restricted or reduced
set of documents.
A common setup includes a non-production search appliance that does not serve results to
most end-users. All configuration changes, updates, and enhancements are tested on this
search appliance, and then pushed to the production search appliance(s) when ready.
The following figure illustrates a staging/development environment.
Staging/Development environment
This is a recommended deployment architecture for all deployments. Where multiple search
appliances are deployed in production using features such as index replication or federation,
Google recommends that this is also reflected in development.
Where possible, avoid using hard-coded naming conventions that might complicate migrating
configurations. For example, dev_collection or test_frontend would need to be renamed to
move to production.

The Google Search Appliance Admin APIs, introduced in software release 6.0, enable you to
export and import fine-grained configurations (for example, individual KeyMatches) or coarse-
grained configurations (the entire configuration for a search appliance). For information about
the Admin APIs, see the Administrative API Developer Guides at http://code.google.com/apis/
searchappliance/documentation/60/.
Simple architecture
In the simplest deployment scenario, a single Google Search Appliance can function by itself
to provide search results directly to end users. While a single server does provide some
redundancy (RAID, dual power supply), there are still many points of potential failure.
Google recommends deploying a single search appliance only when downtime or service
interruptions can be tolerated. A small company or departmental implementation may not have
a critical need for 99.9% uptime, but for mission-critical search applications where many
people depend on the availability of search, Google recommends an architecture that offers a
greater degree of operational continuity.
The following figure illustrates simple architecture.
Simple architecture
This architecture is appropriate for non-critical systems. It is simple, inexpensive, and easy to
configure and maintain. However, it lacks redundancy, has multiple points of failure, and is not
consistent with best practices for critical systems.
Deployment Scenarios 63
Search as a web service
In this scenario, the Google Search Appliance is used as a search service. The search
appliance delivers its results as XML to a web server that directs the user experience. This
scenario is particularly important for public websites that employ page templates and inherited
stylesheets.
In this architecture, the users never interact directly with the search appliance. Instead, their
searches are intercepted by a component on another website, such as a servlet or custom
control, proxied to the search appliance, and transformed into HTML in the web server.
In this architecture, a primary website maintains control over the search experience (such as
stylesheets, page templates and inherited characteristics). Multiple searches can be executed
on a single results page (such as apple.com or reuters.com). However, secure search is
considerably more complex, due to the intermediary web server.
The following figure illustrates web services architecture.

Web services architecture
High availability architecture

The previous scenarios focus on simplicity and usability. But when availability of the search
service is critical, redundancy must be added to the architecture. The goal of high availability
(HA) architecture is to eliminate single points of failure and increase operational continuity. By
adding redundancy to the system, you can minimize downtime and its impact on end-users.
High availability is a standard architecture for customer- or partner-facing websites where
search plays an integral role in the overall user experience, and for large-scale enterprise
deployments.
In this architecture, at least two Google Search Appliances are online and available to serve
results in case one unit fails. The number of search appliances required depends on the query
volume:
• If one search appliance can handle peak query volume, only one production search
appliance and one hot backup are required.
• If two search appliances are required to handle peak query volume, then two productions
search appliances would be required along with one hot backup. This configuration is
known as N+1 Redundancy.
The Google Search Appliance itself has no utility to manage failover or load balancing—an
external load balancer handles this function.
Each search appliance needs to crawl and acquire content independently. Index replication, a
Google Search Appliance, release 6.0 beta feature can also be used to keep multiple search
appliances in sync. For information about index replication, see “Configuring Distributed Crawl
and Index Replication” at http://code.google.com/apis/searchappliance/documentation/60/
dist_crawl/dist_crawl.html.
The other alternative is to export the configuration from the master search appliance and
import it into the secondary servers. This process can be automated by using the
Administrative API, or the Google Search Appliance admin toolkit.
For information about the Administrative API, see Google Search Appliance development
documentation at http://code.google.com/apis/searchappliance/documentation/60/index.html. For
information about the Google Search Appliance admin toolkit, see http://code.google.com/p/
gsa-admin-toolkit/.
For more information about high availability and load balancing, see “Architecting for reliability”
on page 50.
The following figure illustrates high availability architecture.

High availability architecture
High availability architecture offers scalable, distributed search processing, if required.

Increased reliability conforms to best practices for critical deployments. However, this
architecture involves increased cost of hardware, and deployment complexity.
Load balanced environment

There are a number of configurations for a load balanced-environment. Two of the most
common are described in the following sections:
• “Typical load-balanced deployment” on page 68
• “Recommended load-balanced deployment” on page 68
In most applications, the load balancer acts as a frontend and forwards requests to the
backend search appliances.
If users are going to be searching against secure content, configure the load balancer to
handle persistent ("sticky") sessions. Otherwise, the users may be prompted to re-
authenticate. Persistent sessions shouldn't be required if authentication is handled by SSO,
Integrated Windows Authentication/Kerberos, or if there is only public content. This also
ensures consistent search result pagination across the session.
To run health checks on the backend servers, configure the load balancer. A simple ping test
can monitor network connectivity, but fails to detect application-level failures. An ideal health
check is to query the backend search appliances with a real search term and check the
response.
Typical load-balanced deployment
In a typical load-balanced configuration, two Google Search Appliances are physically
connected to a hardware load balancer or are located physically downstream. This setup is
used for increasing serving capacity. Both search appliances need to perform crawls, unless
index replication is used.
The following figure illustrates a typical load-balanced deployment.
Typical load balancing
Typical load-balanced deployment offers ease of configuration. However, there may be

network saturation during simultaneous crawls: crawl traffic for both search appliances must
go through the load balancer. For this reason, it is best to avoid simultaneous crawls.
Also, with a load balancer, there is a single point of hardware (load balancer) failure. Upon
hardware (load balancer) failure, physical access is required to restore search service,
because either the IP address of the search appliances must be changed, or the load balancer
must be fixed.
Recommended load-balanced deployment

In the recommended load-balanced configuration, two Google Search Appliances are logically
downstream from the hardware load balancer, but potentially on different networks. This setup
is used for increasing serving capacity. Both search appliances need to perform crawls, unless
index replication is used.
The following figure illustrates a recommended load-balanced deployment.

Recommended load balancing
In the recommended load-balanced deployment, there is no single point of failure (pre-existing

network excluded). If the load balancer fails, the IP of the search can be changed. Also,
physical access is not necessary to restore search, and there is no network saturation during
crawls.
However, this deployment requires a special load balancer, which supports balancing or traffic
proxying to external virtual IPs. It also requires more complex ACLs, because rules for
additional IPs must be created. And network traffic due to queries is doubled between the load
balancer and the switch.
Disaster recovery deployment architecture

While the deployment scenario described in “High availability architecture” on page 65 meets
failover and redundancy needs within a given datacenter, it does not address disaster
recovery requirements that many organizations have defined for their mission-critical
applications. To meet these needs, multiple geographically distributed datacenters are in
service to provide shorter response times and higher redundancy and data safety in a
regional, national, or global disaster recovery scenario.
Google Search Appliances can be deployed into such an architecture to provide the same
level of disaster recovery for search capabilities. Essentially, the high availability architecture
described on page 65 is deployed at the primary datacenter. The same configuration can be
mirrored in a redundant datacenter where the search appliances are configured to crawl and
index the same content within their respective datacenters.
This model parallels much of how existing systems and servers would be mirrored between
primary and backup datacenters for global redundancy. In the event of a disaster, this
configuration relies on the existing failover mechanism to divert traffic to the backup
datacenter where the search appliances are online and ready to respond to requests.
The following figure illustrates disaster recovery architecture.
Disaster recovery architecture

Integrated architectures
Delivering universal search across all your enterprise assets typically requires integration to
acquire content from diverse sources that may support heterogeneous access protocols. This
section discusses the following common integrated architectures:
• OneBox integration, described on page 71
• Feeds integration, described on page 72
• Connector integration, described on page 73
These integration architectures can be combined as needed.
OneBox integration
One of the simplest to implement and powerful forms of search integration is the OneBox
module, because it enables retrieval of structured, current information whenever a user
searches.
Typically, you deploy this integration as a lightweight Java Servlet, an active server page
(ASP), or module scripting language, such as Python or PHP. This integration is deployed to a
web server or virtual server, and extracts data from the content source as queries are
received.
The following figure illustrates OneBox architecture.
OneBox architecture
OneBox integration is a powerful solution for adding incremental data in a rapid deployment
cycle. OneBox modules can be used to supplement algorithmic search results with non-
algorithmic data. OneBox modules are quick to deliver and deploy.
However, OneBox modules need to perform well by returning search results in under three
seconds. (If it takes more than three seconds, the OneBox module does not appear with the
search results.) Also, OneBox modules are not appropriate for large volumes of data (more
than eight results).
For more information, see “OneBox modules” on page 52.
Feeds integration
Feeds can be pushed to the Google Search Appliance to enrich content with metadata or get
content into the index that the search appliance cannot discover through crawling. Most
commonly, a feed pushes the following types of content:
• Content that is not linked from other documents
• Content that is served by means of JavaScript or some other unsupported protocol
• Content from an application that is not web-enabled
The feed server extracts data from a CMS and other applications, formats it into XML, and
feeds it into the search appliance, where it is added to the index and made searchable. A feed
can also be used to provide a list of URLs for public-facing content that is served from behind
JavaScript.
Where the feed is a metadata-and-URL feed, the search appliance still needs to be able to
crawl and access content, If the search appliance is not able to do this, use a content feed.
The following figure illustrates feeds architecture.

Feeds architecture
Feeds are an effective means of integration and of increasing breadth of scalability. The feeds
API is powerful and can handle high volumes of content.
However, if you are using feeds to achieve more rapid acquisition of content, consider using
aggregation design patterns that group feeds into small batches, rather than high-volume
individual documents.
For more information, see “Feeds” on page 54.
Connector integration
Connectors enable indexing and query-time connections between a Google Search Appliance
and non-web repositories, such as Enterprise Content Management (ECM) systems. A
connector instance traverses a document repository and feeds document data to the search
appliance for indexing. At query time, connectors forward authentication credentials and
authorization requests to the repository.
The key implementation detail in the connector architecture is the connector manager. The
connector manager simply provides an environment for the connectors to run within. The
connector manager also saves various configuration and state parameters for each connector
instance. A single connector manager can manage multiple connectors for multiple search
appliances.
The connector manager does not run on the Google Search Appliance—it must run on a
separate server. The connector manager is fairly lightweight however and it is not usually
necessary to host it on a dedicated server.
The following figure illustrates connector architecture.
Connector architecture
A typical connector server can be deployed on a lightly provisioned server. 2GB of RAM and
20GB of disk should be sufficient in most cases, although you should evaluate your own
specific requirements. The connectors can also be deployed on a virtual machine (VM).
For more information, see “Connectors” on page 55.

Security solutions
In many cases security can be delivered using onboard capabilities such as NTLM, LDAP,
Kerberos, forms authentication, or certificates. An administrator can configure these
capabilities by using the Google Search Appliance’s web-based Admin Console.
Where security requirements are complex or non-standard, a variety of architectures can be

used to deliver appropriate security. These architectures can be combined as needed to
deliver rich and flexible security environments.
The Google Search Appliance is configured with appropriate credentials to crawl and acquire
only the content you want to serve. When a user searches secured content, her user
credentials are checked to see if she is authorized to view content at serve time—ensuring
that users never see content that they are not entitled to view.
The following figure illustrates secure HTTP and file server content architecture.
Secure HTTP and file server content architecture
A secure solution is relatively straightforward to configure. However, depending on your
authentication methods, the solution might require users to enter multiple sets of credentials.
For more information, see “Architecting for security” on page 56.
SAML SPI deployment

In many cases, security requirements can be met using onboard capabilities, but many
enterprises have a diverse, heterogeneous set of security systems, often with non-standard
protocols. This situation is handled by implementing a SAML SPI, which provides a single
security federation point for multiple content sources.
A SAML provider is deployed and the Google Search Appliance is configured to be aware of it.
Content is crawled where possible and fed to the search appliance by standard content feeds
where content is not accessible.
When a user searches secured content, he is authenticated against the SAML SPI. This is
responsible for obtaining all necessary identities and authorizing the user against each content
source to make sure that only authorized content is displayed.
If the user searches content that is not one of the systems for which the SAML provider is the
security provider, the search appliance reverts to one of the other security protocols (HTTP
Basic, NTLM, connector authorization, and so on).
The following figure illustrates heterogeneous architecture.
Heterogeneous security architecture

For more information, see “SAML SPI” on page 58.
Policy ACLs deployment

Late-binding security means authorizing against content sources at serve time. Early-binding
security means using the Policy ACL APIs to pre-load security permissions into the search
appliance. In some cases, late-binding security needs to be supplemented with the flexibility
and speed of early-binding security.
Policy ACLs enable the search appliance to perform security checks more efficiently. This
method is particularly useful when either the network or your content repositories are slow and
may not support real-time authorization.
Access permissions are fed to the search appliance using the Policy ACL API either at crawl
time, or as needed.
The Google Search Appliance uses the policy ACLs where they are defined. Where they are
not defined, then one of the other security methods is used, or if content is not secured, it is
served unsecured.
The following figure illustrates public and secure policy ACLs architecture.
Public and secure policy ACLs architecture
Users need to authenticate to an appropriate identity management system, against which

ACLs are assessed. Although the preceding diagram shows LDAP, it could also be Kerberos,
SAML provider, or X.509 certificates. Forms authentication may not be used as the primary
identification source.
For more information, see “Access control list caching” on page 59.

Federated architecture
In some situations, it makes sense to deploy a federated architecture in which multiple Google
Search Appliances each index separate repositories. This capability, called dynamic
scalability, is useful when you have sets of content that can be grouped based on physical
nodes. Dynamic scalability gives you the ability to scale out (by adding more search
appliances) as well as up (by adding document count to an existing search appliance or
moving to a single more powerful appliance). For more information, see “Dynamic scalability”
on page 46.
Results can be federated together from multiple nodes into one result set. It is more difficult to
provide redundancy for individual nodes in this scenario, because the federation mechanisms
currently do not provide any way to deal with failover on an individual per-node basis.
Typically, this will be achieved by configuring an identical failover deployment for high
availability. Index replication cannot be used with federation.
The following figure illustrates complex federated architecture.
Complex federated architecture
For more information, see “Architecting for scale and performance” on page 48.
Federated high availability deployment architecture
In many cases, the architecture and implementation of a Google Search Appliance and search
solution is simple. However, an implementation can also become much more complex as you
begin to use different combinations, such as federated, high-availability installations in
geographically diverse data centers. This complexity can come in the form of multiple
federated search appliances located in multiple locations around the globe, indexing content
from multiple repositories.
This is a more complex architecture which shows the use of an application layer for
presentation, federated search appliances, using connectors and feeds, replicated in two
global data centers for disaster recovery.
The following figure illustrates a complex architecture.
Complex architecture
This example ties together many of the concepts from the previous examples to create
redundant system with global scope. In this case, there is a global company, with most of their
content based in North America. At the core there are eight federated GB-9009s. By
federating multiple Google Search Appliances, administration can be split across multiple
administrators or departments within the company. A smaller network of three Google Search
Appliances indexes European content, and a load balancer splits traffic between the two sites.

In this case there is a connector manager instance which connects to a Documentum
repository, and a feed server which can push feeds to any of the search appliances.
This architecture leaves room for a number of additional configuration options:
• The eight North American Google Search Appliances could be split into two clusters of
four search appliances each, and then load balanced for capacity or redundancy.
• The European Google Search Appliances could be federated together with the North
American search appliances.
• For additional redundancy, the connector/feed server could be replicated.
For more information, see “Architecting for scale and performance” on page 48.
Post Deployment Chapter 7
Because the search solution is a core business system, you need to ensure processes are in
place for appropriate maintenance and management of it. After you successfully deploy your
search solution, transition it to Business As Usual (BAU). Because the Google search solution
is flexible and standards-based, post deployment can continue to be a period of evolutionary
growth and refinement.
The following sections discuss post-deployment best practices for a Google search solution:
• “Update planning,” which follows
• “Planning for renewal” on page 87
• “Optimizing support” on page 89
• “Using reports to enhance the search experience” on page 91
Update planning
Google releases regular software updates to the Google Search Appliance about twice a year.
You are entitled to deploy any updates throughout your support term. When a new update is
released, consider updating your search appliance.
Google will notify you of any major release (such as the 6.0.0 release), but check the Google
Enterprise Support site (http://support.google.com/enterprise, password required) regularly for
release information. You can also contact your Google representative or Google Enterprise
partner to discuss updating your search appliance.
This section contains information about the following best practices for update planning:
• Read the release notes, described on page 84
• Read the update instructions, described on page 84
• Export and back up your configuration, described on page 85
This section also contains information about major releases (on page 85) and update releases
(on page 86).
83
Software release versions
Software release version numbering for the Google Search Appliance follows a consistent
format, as shown in the following example:
5.2.0.G32-P1
This can be read as software version 5, point release 2, update release G32, VM Patch 1.
Update releases are discussed in more detail on page 86.
To find out what version your search appliance is running, use the Google Search Appliance
Version Manager at:
http://<your-search-appliance>:9941
To access the Version Manager, you need to log in as the admin user (no other user – even
one configured as an administrator can access the Version Manager).
In general, Google recommends that you deploy the latest G release for your software version.
Read the release notes

The release notes contain information about issues that have been resolved, as well as any
open issues. Google periodically revises the release notes, so you should revisit these from
time to time.
The release notes can help you to understand what search appliance behavior may have
changed and what testing you may want to carry out. Use the release notes to help you
determine if there are any specific challenges that an update can help you resolve. Also, look
also for any open issues that you need to be aware of and plan for.
You can find the release notes for all current and recent software versions at https://
support.google.com/enterprise/doc/gsa/00/update_index_page.html (password required).
Read the update instructions

Typically, although not always, a major release requires you to update the Google Search
Appliance with the following two binaries:
• An Operating System (OS) update
• A system update, which delivers the new search functionality
If a release includes two binaries, apply both binaries.
Read the update instructions and understand the sequence of events. The update instructions
provide the steps for executing the update, but you need to plan appropriately to manage the
business impact.
You can find the update instructions for all current and recent software versions at https://
support.google.com/enterprise/doc/gsa/00/update_index_page.html (password required).

Export and back up your configuration
Before updating your search appliance, always perform the following tasks:
1. Export your configurations and front ends to files.
2. Back up your exported configuration files.
3. Place the exported configuration files in a version control system.
You also need to remember the key phrase for full configuration files. Placing this in the check-
in comments in your version control system is a useful practice.
Major releases
The process of updating your search appliance is simple. It consists of the following tasks:
1. Downloading the update binaries from the Google Enterprise Support site.
2. Uploading the update binaries to the search appliance by using the Version Manager.
Plan and execute the update as you would for any enterprise application.
Because the binaries are large, you might want to stage them on a local server so that the
Google Search Appliance can access them without leaving the LAN. It is important to use this
approach if the search appliance does not have external access. Always check the MD5
hashes before uploading binaries to your search appliance.
Understand your update path

If you have not updated from prior versions, it is possible that you may need to do a two-phase
update. For example, because there is no direct update path from version 5.0.4 to 6.0, you
would need to update first to 5.2, and then to 6.0.
This process is not usually problematic, but factor it into your migration plans. The main impact
of this scenario is that it can add a couple of hours to the update time and it may mean that
you cannot take advantage of index migration, if available.
Update development first

Google recommends that you update your development environment first, if you have one.
This approach enables you to test the new version and time the update process. This could
also be a good opportunity to test any new features, so that you can potentially roll them out
with the new release.
Update production
Typically, major updates require rebuilding the index to take advantage of new features. This
means that you need to factor this process into your plans. As of version 6.0, the Google
Search Appliance enables migration of the index also, so that the search appliance does not
need to recrawl content to get it back into the index. Check the release notes and update
instructions to see if this feature is available in the software version to which you are updating
your search appliance.
Post Deployment 85
If you plan to rebuild your index by crawling, include time in your update schedule and crawl
schedule to allow the search appliance to re-acquire content without placing excessive load on
your content servers. This may prolong the update process, so if possible, you should try to
use index migration.
Update sequence
The following steps outline the sequence for updating Google Search Appliances:
1. If you have a hot backup search appliance, update this first and execute regression tests
there.
2. Once the hot backup is updated, you can revert to serving users from your backup
server(s) while you update your production server(s).
3. Update your production servers either one at a time, or in parallel, so long as you have
alternate serving capabilities available. Alternatively, if you have a scheduled maintenance
window during which outages are planned, you could make use of it.
To handle updates without disrupting full capacity, consider having an additional node in
production in conjunction with your Google Search Appliance or load-balanced search
appliances.
Use Test Mode

While you are updating a search appliance, you can put it in Test Mode. This “freezes” the
current index, and continues to serve results from that index while you test against the test
index. The test index continues to crawl and acquire content while you are testing. Test Mode
is a valuable tool that you can use to:
• Test a new installation and if you migrated the index, test the index migration.
• Acquire content in the upgraded index while allowing the old index to continue serving.
This is a powerful tool for migrating with minimal disruption.
Update releases
In addition to the regular primary releases, Google release smaller update releases. Update
releases are marked as G releases, such as release 6.0.0 G32. Check the support site at
https://support.google.com on a regular (once every month or two) for update releases.
Update releases are low-impact and aimed at delivering specific enhancements or addressing
specific issues. As a rule of thumb, you should consider ensuring that you are at the latest G
release available.
As usual before updating, read the release notes and update instructions.

Planning for renewal
Because the Google Search Appliance is a solution delivered on hardware, it needs to be
managed along with your regular hardware and application renewal processes.
The Google Search Appliance is licensed for either two or three years use with full support.
Renewing a search appliance is not usually a complex process, but it does require planning. In
most cases, the process is similar to updating, described in “Update planning” on page 83.
However, the search appliance needs to re-acquire content and you need to plan for
deployment of physical search appliances.
Review physical requirements

In the majority of cases, the Google Search Appliances to be deployed have the same form
factor and power requirements as the installed search appliances.
However, the GB-5005’s (4-10 million documents) and the GB-8008’s (15–30 million
documents) have been replaced by the following new, more powerful units, with greatly
reduced form factors:
• GB-7007—The GB-5005 has been replaced by the GB-7007, which is a 2U unit. You will
no longer require special power configurations, such as the 15 Amp power supply.
• GB-9009—The GB-8008 has been replaced by the GB-9009, which consists of two units:
a 2U appliance, and a 3U storage module totalling 5U. Each node will require a power
supply, but you will no longer require special power configurations, such as the 15 Amp
power supply.
These units ship pre-configured in their own mobile rack unit. For search appliance physical
specifications, see “Planning for Search Appliance Installation” at http://code.google.com/apis/
searchappliance/documentation/60/planning/planning.html.
Review and test versions

When you are preparing to renew a Google Search Appliance, check the software version that
it is running. If you have not kept your existing search appliance up-to-date with software
versions, you may find that the new search appliance comes with a different version. If the
software version on the new search appliance is different, you need to perform the same tasks
that you would for an update. For more information, see “Update planning” on page 83.
Plan for content acquisition

When you deploy your new search appliances, you need to re-acquire content. At this time,
the Google Search Appliance does not support index migration from one search appliance to
another.
Google recommends procuring your search appliances far enough in advance to acquire
content before the planned renewal date. Be sure to consider all content acquisition methods,
including:
Post Deployment 87
• Web crawl
• Database synchronization
• Feeds—both content and Metadata and URL
• Connectors to third-party applications
If you are using connectors, you may need to run parallel deployments for a short period to
ensure that the content is acquired by the new search appliance without affecting the existing
deployment. You need to plan sufficient infrastructure for running parallel deployments.
Review and test your network configurations

Before cutting over to your new search appliance, plan your network configurations, and test
ahead of time, if possible. You should plan and test for the following components.
DNS configurations • Ensure that host names will resolve to their new IP
addresses correctly and that you understand how long it
will take DNS changes to propagate. This is particularly
important for public-facing search being served directly
by the search appliance, where DNS management is
much less predictable.
Load-balancer configurations • If you have a load-balanced deployment, ensure that the

new search appliances can be seamlessly load
balanced into the environment without disruption.
Firewall rules • Ensure that firewalls have been configured appropriately

to allow the renewed appliances access to all content
sources, and resources (such as Active Directory/
LDAP). This will be addressed largely during the content
acquisition phase.
Disaster recovery/hot backup • Review, and if need be, update your scripts and
processes to ensure that failover will be smooth, and
business continuity achieved. If possible, you should test
disaster recover failover shortly after renewal.
Execute cutover
You should execute cutover to the new search appliances during a time of limited user activity.
It is recommended that you communicate actively to users. You should let them know:
• Maintenance is occurring—this should be consistent with your regular maintenance

communications.
• Any new features they may get as part of the re-deployment.
To smooth migration of content, explore using index replication, introduced in software release
6.0.

Optimizing support
The Google Search Appliance comes with full access to support for the duration of the license
period. During the license period, you can access support through the Enterprise Support Site
at https://support.google.com (password required).
Enterprise Support might require you to update your search appliance to a more recent
software release if it is on an older release. Another reason that support might ask you to
update your search appliance is so that you can take advantage of bug fixes that have been
implemented in more recent releases. Updating to a standard release enables you to get bug
fixes without having to deal with various patch releases.
Google Enterprise support engineers provide support and troubleshooting for core Google
products (the Google Search Appliance, connectors, and so on).
On occasion, Google Support Engineers require remote access to your search appliance to
troubleshoot issues.
Accessing the search appliance

Google Enterprise Support uses the methods listed in the following table to access your
search appliance to troubleshoot issues. These methods are available to customers with
Standard Enterprise support.
Under standard support, there are a variety of options open to you:WhWhe

Method Description Notes
SupportCall A process on the search appliance Can be accessed at the following

that opens a secure connection to a URL:
Google server.
http://<appliance-
This requires outbound SSL (port hostname>:8000/
EnterpriseController?actionType
443) access to be configured
=supportCall
through your firewall
Direct SSH A secure shell (SSH) connection It is recommended that you do not
across the internet. leave SSH enabled, but enable it
only for the duration that support is
This requires you to configure your required.
firewall to accept inbound port 22
connections to the appliance from
Google.
You need to enable SSH on the

search appliance from the Admin
Console.
Modem A secure connection using an Although supported, modem is not

analog telephone line. the preferred method, as modem
data rates may limit the speed of
You can use any 56.6 kbps modem, resolution should files need to be
or Google can ship a modem, on uploaded or downloaded to the
request. search appliance.
Post Deployment 89
Method Description Notes
Custom VPN Creates a secure connection to your Must be approved by Google

network by using your standard VPN Enterprise Support.
solution.
GoToAssist Third-party applet that allows a It is recommended that this option
(available secure connection to a Windows be purchased by customers who
only if you PC inside the customer's private have security concerns or
have network, then uses SSH to processes that make allowing
purchased connect to search appliance direct remote access challenging.
Collaborative across the customer's network.
Support)
This method requires a customer
contact to be available on a PC
that can connect to both the
internet, and the appliance.
The PC should have putty or a

similar telnet client installed.
When you contact Enterprise Support, provide the following information to help resolve your
issue:
• Search appliance software version, including G number
• Search appliance ID’s
• Remote access details for chosen methods (SSH configuration and routing, support call,
and so on)
• License information
• Detailed description of the problem, including error messages, screenshots, actions taken,
and so on
Premium support
You can purchase premium support from Google. Premium support entitles you to 24/7 pager
support, and improved service-level agreements (SLAs). Premium support also includes a
secondary search appliance that must be deployed with the same configuration as the
production search appliance.
Disconnected support
When providing disconnected support, Google support does not have remote access to the
search appliance. You can purchase disconnected support, if required, with approval from
Google support. It is recommended that you explore all other support options before pursuing
this option.

Support service guidelines
For full support details, including service level agreements (SLAs), see “Google Search
Appliance Technical Support Service Guidelines” at https://support.google.com/enterprise/doc/
gsa/terms/tssg_gsa_combined_jun09.html (password required).
Additional support
Google Enterprise Support does not support broader deployment issues, such as custom
development supplementing the Google Search Appliance. You can purchase this type of
support from certified Google Enterprise partners in the Google Solutions Marketplace at http:/
/www.google.com/enterprise/marketplace/.
Using reports to enhance the search experience

You can significantly improve the ROI of your search deployment by spending a little time
examining what your users are doing with the search appliance and what kind of search
experience they are having.
By examining this information, you can:
• Understand the business value and criticality of your search application. It is much easier
to assign a business value and priority to search if you know how it is benefiting users.
• Understand what your users are searching for and whether they are finding it effectively.
Insights that you gain will help you understand which features to use, and how to use
them. Giving your users a great search experience increases user satisfaction and
therefore the overall success of the solution.
• Understand what content is important to your users.
Example scenarios are described in the following sections:
• “Using core capabilities to help users find content more efficiently,” which follows
• “Demonstrating ROI and quantifying business value” on page 92
• “Understanding what content is important” on page 93
The Google Search Appliance provides an analytics feature, advanced search reporting
(ASR), that captures detailed information about user search and navigation activity. ASR can
be activated easily through the search appliance web-based Admin Console. Analytical
information can then be extracted from the search appliance and consumed into your existing
analytical tool, or you can process the data using scripts that can be downloaded and
customized from Google Enterprise Labs.
You can use these reports to analyze data such as:
• Searches being executed
• Number of results returned
Post Deployment 91
• How many pages a user clicked through
• Which result on the page they clicked on and where they went.
Whatever your solution, Google highly recommends that you provide a rich analytics
capability, regularly examine the data to refine your search deployment, and identify ways to
add additional value.
Search Experience” at http://code.google.com/apis/searchappliance/documentation/60/
Using core capabilities to help users find content more

efficiently
In this scenario, user feedback is consistently positive and users are able to find the content
they are looking for. They report that they spend substantially less time searching for results.
However, after examining your users’ search behavior, you notice that 90% of searches for
“widget” are immediately followed by a second search for “gadget.” Similarly 50% of users
searching for “vacation” click on the fifth link—to your policy database.
Based on these observations, there are two immediate actions you might take to increase user
effectiveness:
• Activate query expansion and upload your own synonyms list, including an expansion that
equates widget and gadget, so that a search for “widget” automatically becomes a search
for “gadget.”
As a result of this enhancement, 90% of users running a search for widget or gadget will
find search twice as effective. Using query expansion, and adding your lexicon to the
Google Search Appliance is a quick way to increase search effectiveness immediately.
• Create a KeyMatch for “vacation” to return the vacation policy.

This enhancement presents the vacation policy at the top of the search results. For 50% of
users, the result at the top of the page is the one they want to see.
For information about using query expansion and KeyMatches, see “Creating the Search
Experience: Best Practices” at http://code.google.com/apis/searchappliance/documentation/60/
admin_searchexp/ce_improving_search.html.
Demonstrating ROI and quantifying business value

In this scenario, your workforce consists primarily of knowledge workers. By examining your
analytical information, you can see that users typically:
• Execute on average 1.3 searches
• Retrieve result number 3
• Spend on average 45 seconds using search

This is a substantial improvement on the estimated 25% of time spent on search by the
average knowledge worker. (Source: “Hidden Costs of Information Work: A Progress Report,”
published on the IDC website at http://www.idc.com/research/view_lot.jsp?containerId=217936.)
Understanding what content is important

In this scenario, you have a large number of content repositories across a variety of internal
websites and fileshares. By analyzing your data, you see that most users click on content from
your corporate wiki. This immediately tells you where your users find the content most
compelling.
Because you know that the corporate wiki is where your most useful content is, you can create
a result biasing profile that moves it higher in search results. By doing this, you ensure that the
corporate wiki appears in results where users can most quickly find it.
For information about creating result biasing profiles, see “Using Result Biasing to Influence
Result Ranking” at http://code.google.com/apis/searchappliance/documentation/60/
admin_searchexp/ce_improving_search.html#h1resbias.
You can also understand what content is important by using analytics to gather information
about user clicks. If your organization has an existing analytics solution in place, It may be
possible to use this solution to provide analytical insight into the user search experience. In
many cases, integration with a third-party analytical solution requires some effort to get
search-specific reporting, but there is substantial value that can be derived from the data.
Post Deployment 93
Putting the User First Chapter 8
The success of your deployment depends not only on the breadth and depth of search, but
also on how satisfying and effective the search experience is for users. There are many things
you can do to drive user satisfaction and increase use of the search solution. The following
sections discuss tools for enhancing the search experience:
• “Presentation methods,” which follows
• “Enrichment features” on page 99
• “Google Enterprise Labs” on page 100
• “User feedback” on page 101
Presentation methods
There are two primary methods of delivering the search experience to your users:
• Google Search Appliance presentation layer, described on page 96
• Application presentation layer, described on page 96
Choose an appropriate method for your users based on the outcomes you are trying to
achieve and technical requirements.
95
Google Search Appliance presentation layer
The Google Search Appliance uses an XSLT stylesheet for its presentation layer. Using this
built-in presentation layer has several advantages:
• All presentation is rendered on-box and delivered direct to the user. The search appliance
does not require any additional hardware to manage presentation.
• Built-in user features (such as query suggestions, dynamic result clusters, and so on) can
be enabled and delivered to users as simply as selecting a checkbox.
• Relatively sophisticated user experiences can be delivered by means of declarative XSLT

transformations through direct customization of the stylesheet.
However, there are some limitations—most notably that highly sophisticated, interactive or
JavaScript-rich user interfaces are more challenging to deliver, primarily due to the declarative
nature of XSLT and security restrictions that prevent uploading of content to the search
appliance. If the search experience is implemented using the built-in presentation layer, all
JavaScript must be embedded directly into the output HTML pages, which may lead to
browser inefficiencies.
For information about using the Google Search Appliance presentation layer, see “Creating
the Search Experience” at http://code.google.com/apis/searchappliance/documentation/60/
admin_searchexp/ce_understanding.html.
Application presentation layer

The Google Search Appliance can return search results (including metadata) directly in XML
which can be processed and rendered on a separate application server or portal. This method
has several advantages:
• Presentation can take full advantage of the flexibility and richness of modern programming
languages, such as Java, Python, .NET or even Flash to provide an extremely rich and
interactive UI.
• Removing the rendering of content from the search appliance also removes the
processing required by the search appliance.
• Additional resources (such as style sheets, JavaScript files, images, and so on) can be
hosted on a separate server and delivered to client browsers as included resources,
improving perceived performance to users.
• Security can be managed at the application level by allowing the application to determine
the collections and front-ends a user is able to see.
However, using a separate application presentation layer has some disadvantages:
• The deployment architecture is more complicated because of the additional hardware

required.
• It is somewhat more challenging to manage document-level security because credentials

need to be captured at the application layer and forwarded to the search appliance.

• Incorporation of some of the core features may require more effort to bring into the UI. For
example, enabling dynamic result clusters with the search appliance presentation layer is
as simple as clicking a check box in the Admin Console. To enable this feature with an
application presentation layer requires manual implementation in your application.
For a diagram illustrating use of an application presentation layer, see “Search as a web
service” on page 64.
For information about search results in XML, see “XML Output” at http://code.google.com/apis/
searchappliance/documentation/60/xml_reference.html#results_xml.
Using the search appliance’s front ends

The Google Search Appliance feature that enables you to create different search experiences
for users is the front end. A front end is a framework that manages most of the elements of a
single search experience. Front ends can be broadly classed into two categories of
functionality:
• Content filtering and enrichment
• Presentation/look and feel
Content filtering and enrichment

This category contains many of the features described in “Enrichment features” on page 99.
These enhance the user’s search effectiveness significantly. Because many of these result in
changes to the results XML, these features should be used regardless of whether presentation
is delivered on-box or by means of a separate presentation layer, as they can generally be
leveraged at the application tier.
Presentation/look and feel

Changing the look and feel of the UI can be as simple as updating the global settings and
activating the usability features using the search appliance’s Page Layout Helper, or as
sophisticated as extensive customization of the stylesheets and deployment of labs assets.
You may choose to completely bypass this if you are using a custom presentation layer and
incorporate only the features that enhance usability.
For more information about front ends, see “Managing the Search Experience” at
http://code.google.com/apis/searchappliance/documentation/60/admin_searchexp/
ce_understanding.html#h1manexp.
Using collections to manage the user experience

Collections are used to create logical groupings of content within the index. A document can
appear in as many collections as needed, and collections can include all documents, or be as
narrow as a single document. You can enable users to search all collections or restrict them to
a specific one.
Putting the User First 97

Understanding how your users are searching and what they are looking for can help you to
use collections effectively. For example, users in support may need to see all your product
documentation. For them it might be useful to separate results into the following collections:
• Product documentation
• Support requests
• Bugs and issues
These collections could be presented as separate tabs on the same page.
Bringing collections and front ends together

Different user groups may have dramatically different search needs with different
presentations, results filtering rules, and content.
For example, a marketing or public relations department might want a visually rich, interactive
UI that enables them search for previous communications, video, audio and images. On the
other hand, IT support might want a fast, light UI that enables them to search for technical
content quickly.
To meet the different user interface needs of each department, a search appliance could have
two different front ends. To meet the different content needs of each department, a search
appliance could have multiple collections. Collections could be used to segment the index in
ways that serve the different departments.
If both departments need to search the same content, filtering, enrichment, and biasing
profiles can be used to provide a different set of results for each. While public-facing product
documentation is of primary interest to the marketing department, this content may be of
secondary interest to support, who should be able to find it, but as a secondary priority to
current support tickets.
Using front ends and collections together effectively can substantially improve the search
experience for all users through a powerful and flexible range of deployment options.
For more information, see “Using Collections with Front Ends” at http://code.google.com/apis/
searchappliance/documentation/60/admin_searchexp/ce_understanding.html#h2coll.

Enrichment features
Several Google Search Appliance enrichment features enable you to customize search results
and enhance the user experience. By using these features, you can ensure that users get
search results that are appropriate to their interests, roles, departments, locations, languages,
or other characteristics.
For example, Alpha Inc. is releasing a AlphaLyon 3.0, a new software version of their flagship
product. The company want to ensure that when users search for AlphaLyon, information
about release 3.0 appears among the top search results.
The Google Search Appliance offers several enrichment features. The following table lists
some of these features.
Query suggestions • When query suggestions are enabled, search queries

auto-complete and query suggestions, with the most
common searched terms, appear as a user types in the
search box.
KeyMatches • KeyMatches are preferential search results, or

recommended links, that appear at the top of the search
results. A KeyMatch gives an end user an opportunity to
navigate immediately to the recommended document.
This means that the end user spends less time
searching for documents and more time looking at them.
KeyMatches are a great way to promote a specific
document, such as a new product or policy document.
Dynamic result clusters • Dynamic result clusters show different topics for a
specific search term. These topics enable users to focus
on areas of interest while ignoring irrelevant information.
When a user clicks on any of the topics, the search
appliance returns a new, narrower set of results.
Result biasing • Result biasing enables you to influence the way that the
search appliance ranks a result, based on URL,
document date, or metadata in or associated with the
result. You can use result biasing to increase or
decrease the scores of specified sources, or types of
sources, in the search index. These local settings can
affect the order of the search results, and give a different
user groups different biasing profiles to a customized
search experience.
User-added results • User-added results give users the capability to add

search results for certain keywords. User-added results
cause designated documents always to appear on the
results pages for specified keyword searches.
Alerts • Alerts allow users to monitor topics of interest by

receiving search results for these topics in email
messages.

Alpha Inc. uses these search appliance enhancement features to promote AlphaLyon 3.0 with
the following results:
• As the user types “AlphaLyon” in the search box, query suggestions cause the search
query to auto-complete before the user finishes typing it. Alternative terms that narrow the
search, including “AlphaLyon 3.0,” also appear in a menu below the search box.
• A KeyMatch for AlphaLyon 3.0 appears at the top of the search results, proclaiming “New
Release! AlphaLyon 3.0 Documentation” that guides the user to the documentation for the
new release.
• Dynamic result clusters cause dynamically formed subcategories based on the results of
of the search to appear along with algorithmic results. Each subcategory groups similar
documents together. For AlphaLyon, such categories might include “AlphaLyon 3.0
product information,” “AlphaLyon 3.0 documentation,” and “AlphaLyon support options.”
Instead of reading through all search results, users can browse a subcategory.
• Result biasing causes documents about AlphaLyon 3.0 to appear higher in the algorithmic
search results than documents about earlier versions.
Because Alpha Inc. enables user-added results, their users have the capability of adding
search results for key words. For example, a user adds a result for “AlphaLyon 3.0 Installation
Guide” that appears on the results page when anyone searches using the keyword
“AlphaLyon.”
Alpha Inc. has also enabled alerts, so users can monitor topics, such as AlphaLyon 3.0, and
receive search results about them in emails.
For comprehensive information about all Google Search Appliance enhancement features,
see “Creating the Search Experience” at http://code.google.com/apis/searchappliance/
documentation/60/admin_searchexp/ce_understanding.html.
Google Enterprise Labs

Google Enterprise Labs (http://www.google.com/enterprise/labs/) is where Google puts
experimental features that you can incorporate into your search deployment. These features
can usually be deployed on a variety of Google Search Appliance versions and enable you to
enrich your users’ search experience between release cycles.
Google Enterprise Labs features are usually pre-built, ready to go, and easy to deploy.
Many of the Google Enterprise Labs experimental features eventually graduate to the search
appliance, and become part of the core product. For example, query suggestions, dynamic
result clusters, and user-added results all started on Google Enterprise Labs but have now
been incorporated into the core on-board capability.
Although experimental features are not supported by Google, certified Google Enterprise
partners are experienced with these capabilities and are able to help implement them. You can
find a Google Enterprise partner at the Google Enterprise partner directory at http://
www.google.com/enterprise/gep/directory.html.

User feedback
Users are conditioned to having a great search experience with continuous innovation on
Google.com. By using the search appliance, you can deliver a similarly innovative and rich
experience to your users.
One of the best ways to innovate is by capturing user feedback on what they like and don’t like
about the search solution, as well as understanding how they are using it. User feedback is
critical to successful deployment. To deliver value, not only must you deliver a great search
experience, but you need to have users actively using it.
There are several methods for getting user feedback, including:
• Implicit feedback
• A feedback link
• A user survey
Implicit feedback
By activating advanced search reporting, or another analytical capability, you can
automatically see what your users are doing, where they are succeeding, and how you can
help them be more effective. However, it’s important not only to capture this data, but also to
use it.
Search Experience at http://code.google.com/apis/searchappliance/documentation/60/
Feedback link
Make it easy for users to provide feedback by providing a link or an email address for
submitting their comments.
User survey
A user survey is a great tool to analyze how satisfying users find your search solution. Surveys
should be sent out regularly, and after each phase in your deployment, so that you can iterate
rapidly, and continue to delight your users. Appendix C, “Enterprise Search Satisfaction
Survey,” contains a sample user survey.

Best Practices Appendix A
This appendix presents some best practices in the following major areas of search appliance
deployment:
• Datacenter and installation
• Crawl
• Feeds
• Index reset
• Collections
• Serving
• Front end stylesheets
• Security
• Ongoing administration
Datacenter and installation
Use an uninterruptable power supply

The Google Search Appliance's RAID configuration can help mitigate disk failures, but a
power failure can cause data loss which your system may not be able to recover from. While
there is no built-in support for specific UPS units, many UPS devices can execute scripts upon
power failure. By using the Google Search Appliance Admin API or admin toolkit scripts, you
may be able to configure your UPS to initiate a graceful shutdown of the search appliance.
103
Use dual power sources
The GB-7007 and GB-9009 models ship with redundant power supplies. Even if your site does
not have dual power sources, it is beneficial to use both power supplies. At a minimum, each
power supply should be attached to a different circuit, and separate UPSs, if possible.
Use gigabit Ethernet

While the Google Search Appliance's NIC can be plugged into a 10 Mbit or 100 Mbit port,
Google does not recommend using the speed of the network connection in an attempt to slow
down the crawl rate. Instead, adjust the speed of the crawl by using the Crawl and Index >
Host Load Schedule page in the Admin Console.
Crawl
Start with the largest content repositories

The Google Search Appliance's crawler can be an excellent tool for finding content on your
company's internal network. That said, if you specify unconstrained follow and crawl patterns,
the crawler may find all kinds of low value content across hundreds of servers on a typical
Intranet. An example of an unconstrained follow and crawl pattern is mydomain.com. Using
this as a follow and crawl patterns would cause crawling of all protocols and subdomains
without any active control or management.
When deploying a search appliance in a complex environment for the first time, it is best to
focus on the largest or most important content repositories, rather than trying to index
everything.
When to use continuous crawl or scheduled crawl

In most cases, Google recommends using continuous crawl mode for large indexes. Using
scheduled crawl mode is recommended only when the search appliance is crawling a small
number of hosts with limited document counts.
Adjust crawl speed using host load settings

In continuous crawl mode, the speed of the crawl can be adjusted on the Crawl and Index >
Host Load Schedule page in the Admin Console. The default host load of 4.0 may be too
aggressive in some cases. A host load of 4.0 essentially means that the Google Search
Appliance fetches four documents at a time from each server. Use host load to manage
serving capacity on your content hosts actively.

Initial crawl
During the initial crawl, actively monitor crawl diagnostics for content growth. Also, use this
time to manage content acquisition and corpus composition by fine tuning the follow and crawl
patterns. Unattended crawl of a new site might have unforeseen consequences.
If the search appliance is putting too much load on your servers (crawling too aggressively),
you can change the default host load or add rules for specific hosts or time periods. Some
examples of when you would set specific host loads are:
• Limit crawling to off-peak hours
• Limit crawl speed for "slow" hosts or hosts on slow network connections
• Crawl a new server quickly (and then drop the load down when complete)
Carefully choose rules for crawl and do not crawl

For "Follow and Crawl" and "Do Not Crawl" patterns, as well as any other place where URL
patterns can be used, it is best to use the most specific pattern possible.
Regular expressions are costly and can affect crawl and index speed. If you are using regular
expressions, you should optimize them for efficiency. For example, regexp:pdf$ is better than
regexp:pdf because the crawler only needs to check the end of the URL.
In most cases, the pattern definitions have little impact on crawl performance, but take care
when dealing with:
• Large indexes (> 1M docs)
• Large number of patterns, or very complex patterns against very long URLs
In General:
• Use static strings instead of patterns when possible
• Use patterns that match the smallest patterns possible
• Anchor patterns with start (^) or end ($) when possible
• Group similar patterns together when possible: param=(foo|bar|goo) is better than three
separate patterns
Feeds
For near real-time indexing, use feeds. For example, a publication company might need to
ensure that all content is searchable as soon as it is published. A content feed or metadata
and URL feed might be the most effective way to get this content into the index.
There are a number of cases where feeds can enhance search deployment. For a detailed
discussion of use cases, see “Feeds” on page 54.
Best Practices 105

Index reset
Resetting the index removes all content, so everything needs to be crawled again. Resetting
the index is not something you should normally need to do unless you are re-purposing a
Google Search Appliance or are advised to do so by Google Support.
One scenario where an index reset may be warranted is if you have a lot of unlinked content
that is still in the index. In some cases, links are removed from web pages but the destination
content is still available. The Google Search Appliance continues to crawl these "orphan"
pages because it knows about them and they will not be removed from the index unless a 404
is returned (or they are otherwise excluded).
If you need to reset the index, export all of the URLs in the index before performing a reset.
Collections
The master index of the Google Search Appliance can be segmented into multiple collections.
Collections are useful for enabling users to narrow their searches to specific content areas.
Collections can also be used to provide segmented search results.
Some examples of when collections may be useful include:
• You want to segment content into "Engineering," "Sales," "Marketing," "HR," and "All" and
to enable users to select which collection they wish to search.
• You want to provide the option for corporate users to search over public website content in
addition to corporate content.
Use dedicated service user identities to crawl protected content. Do not simply use the
administrator’s identity or an arbitrary user ID, as this might be prone to failure if the user
changes his password or leaves the organization.
Serving
Enabling query expansion

Query expansion automatically expands search terms to include other terms. Query
expansion modifies the actual query, and therefore the returned result set. For example a
search for "light" becomes a search for "light or lighter or lightest or lighting or lights."
In most situations, you should enable query expansion. Although query expansion can have a
good impact on search results relevancy and quality, it is disabled by default.

Stemming dictionaries can be enabled globally by using the Serving > Query Expansion
page in the Admin Console. Enable only the languages you need, then enable query
expansion for each front end by using the Filters tab on the Serving > Front Ends page.
Google usually recommends selecting "Full" query expansion, which applies the active
Google-supplied dictionaries along with your own.
Creating your own query expansion dictionaries is a great way to provide synonyms for
acronyms, jargon, and company-specific terms.
Using query expansion, KeyMatches, related queries, or

OneBox
Query expansion • Should usually be enabled on all front ends.
KeyMatches • Used like advertisements to display specific urls and text

for specific phrases & queries. URLs returned as
KeyMatches are not actually part of the index, and are
not controlled by the crawl pattern rules.
• External: Use KeyMatches to highlight new products or
specific areas of your site.
• Internal: "advertise" new employee benefits, highlight
top-level landing pages.
Related queries • Similar to spelling suggestions.

• Use related queries to guide users to a different query. If
a user searches for "cookies" a related query might
prompt: "You could also try snacks"
OneBox modules • "Federate" a query out to another service, bringing the

results back and including them with the normal search
results.
• Google.com uses OneBoxes to present real-time data
from a variety of sources (weather, stock quotes, flight
status, and so on) and you can use the same technology
on the search appliance.
• A common OneBox is employee directory search.
Best Practices 107

Front end stylesheets
There are three methods of processing search results from the Google Search Appliance for
presentation:
Page Layout Helper • Use this option when you don't need to do much
customization to the default stylesheet. You can add
your own logo, change the header and footer, as well as
basic results options.
XSLT stylesheet • Use this option when you want to serve formatted results
directly from the search appliance and apply your own
stylesheet. This enables you to customize every aspect
of the results pages. Also useful if you want to return
your own XML schema, RSS, or JSON.
XML output • If you exclude the proxystylesheet parameter in the

request URL, the search appliance returns raw XML.
This is useful if you have a multi-tier architecture, and
want to handle the presentation yourself.
Use the method that best suits your needs.
Security
Don't mix public and internal content on a public-facing machine. Even though it may be
possible to index internal (intranet) and external (website) content with the same Google
Search Appliance, keep that data separate if the search appliance can be accessed publicly.
Ongoing administration
While the Google Search Appliance does not typically require a large team to manage the
deployment, you need to carry out some regular administration tasks, including:
• Exporting and backing up configurations regularly, storing the configuration in a version

control system. See “Staging/Development environment” on page 62 for a recommended
configuration.

• Using reporting to optimize your search experience.
• Activate Advanced Search Reporting (ASR).

• Review ASR Reports for opportunities to enhance search effectiveness
• Review reports for missing content (crawl coverage)—you can use crawl diagnostics,
and as of software release 6.0, the Export URLs feature.
• Review reports for common query terms to identify query expansions, KeyMatches,
and related queries that should be added.
• Set up monitoring, as described in “Configure Monitoring” on page 30.
• Logging by means of syslog
• The Google Search Appliance provides the ability to send query logs to an external
syslog server. This can be especially useful if you have sever search appliances in a
load-balanced configuration and wish to aggregate the logs to one central place.
Administration Tips:
• Enable SSH only when needed for support.
• Regularly check your Crawl Diagnostics for crawl errors.
• Regularly check http://support.google.com/enterprise for updates.
• Actively seek new content to include in your index.
• Monitor your document count. If you are approaching your index limit, consider upgrading
to include new valuable content.
• Look for unexpected document volumes from specific repositories—this may indicate
unexpected behavior, such as multiple URLs for the same document if a session ID is
appended or similar.
Best Practices 109

Technical Solutions for Common Challenges Appendix B
This appendix presents some technical solutions for common challenges in the following
major areas of search appliance deployment:
• Crawling and indexing content
• Security and serving secure content
• Document relevancy
• Interfaces and front end customization
• Other areas
Google does not provide technical support for configuring servers or other third-party products
outside of the Google Search Appliance, nor does Google support solution design activities. In
the event of a non-Google Search Appliance issue, you should contact your IT systems
administrator. GOOGLE ACCEPTS NO RESPONSIBILITY FOR THIRD-PARTY
PRODUCTS. Please consult a product’s web site for the latest configuration and support
information. You might also contact Google Solutions Providers for consulting services and
options.
Crawling and indexing content

I am trying to index a repository that Google doesn't supply a connector to? What can I
do?
• Check to see if the repository supports access by means of http (or https)? If so, then
index using standard HTTP start URLs.
• Check to see if there is a partner who has a connector for that particular repository that
you can use to index the content.
• If there are APIs available you could write a connector, or use them to extract the content,
generate a feed, and push the content into the Google Search Appliance.
I need the Google Search Appliance to crawl my Portal, but the cookie is strange and
doesn't conform to the RFCs—how do I crawl?
111
• The Google Search Appliance is designed to support internet standards known as RFCs.
When a content source does not follow RFCs, you will need to manage non-standards
based implementation with supplemental technologies.
A common way to manage non-standard cookies in this context is to deploy a lightweight

proxy, that modifies the cookie as it passes through the proxy. This could be implemented
as an Apache server with a simple Python, Perl, or PHP class configured as a filter.
How can I explicitly specify the file types to be crawled rather than exclude what I do
not want to be crawled?
• This is a typical requirement in the case where file shares are to be indexed. Do this by
deleting everything from the "Do-not-crawl patterns" field and add a regular expression to
the crawl patterns that looks something like this:
regexpIgnoreCase:^http://host\\.domain\\.com/folder/
.*(.doc$|.xls$|.ppt$|.docx$|.xlsx$|.pptx$|.rtf$|.pdf$|.txt$|.htm$|.html$|/$).
Within the brackets you can explicitly specify the file types to be crawled separated with
the pipe sign. Remember that the sub-string /$ is mandatory in order to traverse through
the directories.
(Note that this may not work if the content is streamed by means of an application so that
the file extension is no longer part of the URL.)
How can I have the Google Search Appliance index and serve public emails and
messages from MS Exchange 2003?
• You can index all the MS Exchange content with a Google Search Appliance out-of-the-
box if you have Outlook Web Access (OWA) enabled. With OWA all emails and contacts
becomes accessible by means of HTTP and all is HTTP Basic protected by default (other
options are possible). This means if you set up the crawl patterns and the crawler access
appropriately, you can get everything into the index and serve it either with an AuthZ
check by means of a HEAD request or you can set up group policies respectively.
• The full set of instructions on how do this can be found at: http//docs.google.com/
View?id=dd6k8c37_41gkc6dwfj
I need to add URLs to be crawled to my Google Search Appliance dynamically. How can
I do this?
• While you can feed URLs into the Google Search Appliance, they have to already exist in
the Follow and Crawl Patterns. Therefore, in order to add them to the follow and crawl
patterns dynamically, you will have to use the Google Search Appliance Admin API to do
this, then you can either use the Admin API to add then to the start URLs path, or you can
create a web feed to push them into the search appliance.
I am not sure whether my forms authentication protected site can be crawled without
any problems. How can I find out?
• Check whether the login procedure conforms to the usual http standard:
1. login to your web site and copy the URL of one of your forms authentication protected
documents.
2. Close the browser and/or make sure you are really logged out.
3. Paste the URL into your browser in order to re-open the document.

4. Trace the http headers while you do that (for example, use the Firefox Add-on Live HTTP
headers or for MSIE use ieHTTPHeaders).
The browser should redirect you to the Log-in pages since you are not yet logged in. If
your server responds with "HTTP/1.x 302 Moved Temporarily" and a redirect specified in
the header field "Location: <the URL of the log in page>" to requests from an
unauthenticated user it behaves standard conformant. In this case the Google Search
Appliance will be able to get access to your protected documents. If your server responds
with a "HTTP/1.x 200 OK" and displays the login page (or any other non-standard
conformant way on order to display the login form) you need to find another way.
I am having problems setting up or crawling a website protected using forms based

authentication.
• Check to ensure that the login page does not use JavaScript. Google Search Appliance
forms auth wizard can tolerate forms with some basic JavaScript such as those perform
range check prior to submission. Such JS code normally is okay, as long as the form
submission itself is implemented as JavaScript style, such as "JavaScript:".
Also be careful when onSubmit() function is used as the form submission behavior would
be different for the wizard. If it does, create a login page that does not use JavaScript.
Also, be careful that the forms page, if it uses javascript, does not alter/add parameters
before submitting them. If it does, these will need to be adapted into the non-JavaScript
version of the login page.
As well, any hidden parameters, and so on will also need to be incorporated in order to
allow the Google Search Appliance to successfully login and access the website.
• When JavaScript code cannot be easily removed, it is possible to work around this using
additional tools, such as the Firefox add-on called Firebug. Use this to intercept the
request and manipulate the objects. This does not always work, but in cases when
particular static fields are to be added, it should work.
For example, some application may prefix the username with an internal code before it is
submitted, but the prefix is static in most cases, the add-on would be easiest to move the
wizard forward without needing a non-JS form.
• Also, some times the internal sites use SSL but either without a valid certificate or without
configuring the CA in Google Search Appliance. In this case, we can try to use plain HTTP
(provided that this is still supported and allowed). During search, the style sheet can be
customized so that the protocol of the results is then converted from http to https.
I have documents that are larger than 30M in size. How can I get these indexed?
1. Convert these documents to text (there are a number of freeware and shareware
applications available to do this).
2. Wrap them in HTML.
3. Apply a meta tag that has a unique name and has a value of the original file location.
This HTML document can then be indexed or fed in to the Google Search Appliance, thus
allowing the textual content to be searched.
Technical Solutions for Common Challenges 113

4. Modify the front end stylesheet to point to the original file location as indicated by the meta
tag, if it exists.
This will allow the content to be searched, and the original document to be accessed by
means of the results list. As well, be aware that the relevancy depends among other
factors on text formatting. Therefore, this solution might affect the relevancy.
I am trying to get the Google Search Appliance to crawl a URL contained within
JavaScript but the crawler won't pick it up. How can I get it?
• Use a jump page
• Use a feed
• Use anything other than crawling to make up the site coverage deficiency due to the use
of JavaScript, such as a web feed.
How can the Google Search Appliance index a personalized portal? What about a portal
that allows both the guest users and registered users using the same URLs?
• A dynamic application is in most cases template-based. Add google on/off tag to avoid
indexing redundant and/or contextual info such as header, footer, left nav, top nav, right
panel, and so on.
• Any personalized content fragment (such as greeting messages, message inbox portlet)
should be excluded from being indexed either by means of do-not-crawl patterns and
googleon/googleoff tags.
• If a URL is served for both guest and member with different behavior, the application
should accommodate the crawler's need to differentiate those versions. Ideally, the
crawler could start off with an extra parameter such as "&as=guest" or "&as=member,"
and the application should preserve this link throughout the application. A collection
should be generated based on the extra parameter in the URL patterns. And the front end
style sheet should strip them out when rendering results. (For security reasons, this extra
parameter should only be processed by the application if the requests are from know
Google Search Appliance IP addresses.)
I use a CMS system that is easier to crawl. But the content is published to a different
production system, which is not suitable for crawl (or not allowed due to load issues).
What are the things that I need to consider?
• URL conversion.
• Different security mechanisms. Google Search Appliance assumes that the security used
for crawl would also be used for authorization. Try to use policy ACL to work around this
issue.
I need to apply metadata to URLs that the Google Search Appliance is crawling before it
is indexed. How can I do this?
• Use a proxy when crawling and apply the metadata, based on programmatic rules, to the
data before passing it through to the Google Search Appliance.

How can the Google Search Appliance index content stored on Novell Netware?
• Web-enabling the file server will allow you to index the content. If you have a good web
server, such as Apache httpd, which can be configured for strong security, there shouldn't
be any security concerns. It can also be configured so that only the Google Search
Appliance's IP address can access it, making it completely inaccessible to any other
machine.
I have Novell Netware with lack of CIFS and web-enabled support. How can I integrate
the Google Search Appliance with this, by means of a connector or some other
mechanism?
• By utilizing code, which uses the Novell Java Libraries to check permissions against the e-
Directories (which has a concept called "effective rights"), you can crawl over CIFS
enabled drives. If you query for this on a per-document basis, you can get permissions. An
administrator will need to set it up - and there is a bit of trial and error getting the right
permissions setup because Effective rights comes from both the directory, and the parent
container.
• You can also use the instructions on How to Index and Serve Novell Netware File Servers
with a Google Search Appliance which can be found at: http://docs.google.com/
View?id=dd6k8c37_42ch8twqcg.
I want the Google Search Appliance to index content from Oracle Content Server/
Stellent. How can I accomplish this?
• If OCS/Stellent is used as a basic content management/web publishing system (for public

content), the Google Search Appliance can crawl through HTTP easily. Otherwise
Frontline Logic has a connector, available for a cost, which will enable you to crawl and
index the content securely.
How can I disable the indexing of part of a web page?
• By using GoogleOn and GoogleOff tags you can prevent all, or portions, of a web page
from being indexed. The full use of these tags can be found at: http://code.google.com/apis/
searchappliance/documentation/60/admin_crawl/Preparing.html#pagepart.
How can I limit the crawl to the top n levels of my site?
• The following URL patterns will include the top three subdirectories on the site
www.mysite.com:
regexp:www\\.mysite\\.com/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*/[^/]*$

Security and serving secure content
I need the Google Search Appliance to serve secure content from system x, but system
x is really slow. How can I work around this?
• This problem primarily has an impact on serving. Crawling executes in the background, so
while this has an impact on the speed of content acquisition, it does not have an impact on
user experience. To improve performance during search, you could use policy ACL's and
early binding to allow the Google Search Appliance to manage authorization in a
performance-optimized way.
When the built-in UI is served through secure HTTP (for example, access=[a|s]), and the
interface has customized page elements, for example, a logo - served from a non-
secure HTTP source, web browsers will usually display a warning to alert that secure
and non-secure page components are being displayed every time the page is loaded. Is
there any way to suppress the warning?
• Either the images have to be served from a secure server, or the browser's options will
have to be set to suppress the warning, but that would require a change to every users
browser and it isn't advisable.
We require a unified search, across multiple secure repositories, on one Google Search
Appliance. How can we implement this with silent authentication or single sign-on?
• The Google Search Appliance Valve Security Framework (http://code.google.com/p/gsa-

valve-security-framework/) was designed to answer both of these issues. It exposes a
global authentication capability to the search user and then loads transparently the sets of
credentials that are relevant to each indexed sources. It is a framework that can easily be
extended to support the specifics of new repositories in terms of authentication and
authorization processes.
I have a specific page (or pages) indexed into my Google Search Appliance that I would
like to remove. How can I accomplish this?
• If you want this page visible in other front ends, then you can force a front end to ignore it,
by adding this URL to the Remove URLs tab for that specific front end.
• If you would like to completely remove this document from the Google Search Appliance's
index, then you can use a delete feed. For more information on creating a feed which will
delete content, see the appropriate section in the Google Search Appliance Feeds Guide:
http://code.google.com/apis/searchappliance/documentation/60/feedsguide.html#removing_url
• You can use the remove or recrawl URL tool in the Google Search Appliance Admin
Toolkit (http://code.google.com/p/gsa-admin-toolkit/).
• You can add it to the do not crawl URL's.
• You can remove it from specific collections.
When indexing by means of SMB the directory 'pages' get index and can appear in the
results. How can I make sure that these pages are not shown in the search results?
• Put ./$ as an exclude pattern for a collection and the directory 'pages' will not be part of the
collection.

How can the Google Search Appliance crawl content on the Apple File System?
• You can use SMB to crawl the content. The only real issue to watch out for is the fact that
the Mac OS won't initiate the SMB processes until someone initiates a connection.
How can the Google Search Appliance index a sitemap?
• Using the free tool "GSA Feed Manager" (http://code.google.com/p/gsafeedmanager/) you

can grab sitemaps, and it will feed them directly into one or more Google Search
Appliances.
Document relevancy
How can the Google Search Appliance sort the results by other criteria than relevancy
and date?
• It is exactly the purpose of a search engine to sort the search result by relevancy.
Everything else is rather the output of a data base query. Unlike the Google web search
the Google Search Appliance can sort results also by date.
If you need to sort the results by any other numeric value you have, you can abuse the
date sort feature. To do so convert the value to an ISO-8601 date format (YYYY-MM-DD)
and insert it to a meta tag in your document. The lowest value must not be earlier than
January 1, 1970. Then set up the respective name of the meta tag in the section
"Document Dates" in the admin console. The Google Search Appliance considers the
value of this meta tag as the document date and can sort it by this value.
I want to promote a URL to the top of the results. How can I do this?
• Use KeyMatches.
• Implement Do-It-Yourself-KeyMatches from Google Enterprise Labs (added to the search

appliance in release 6.0 as a beta feature called “user-added results” ).
• Create a result biasing policy which increases the relevancy of documents based upon the
URL. Attach this policy to the appropriate front end.
How can I increase the relevancy, in the search results, of more recent documents?
• Create a result biasing policy which increases the relevancy of documents based upon the
date that they were last modified. Then, attach this policy to the appropriate front end.
How can I modify the relevancy of specific URLs, either increasing or decreasing it?
• By using the Google Search Appliance Ranking Framework (http://code.google.com/apis/

searchappliance/documentation/60/admin_searchexp/
ce_improving_search.html#rescoringframework) you can:
• Specify rescoring for results that exactly match specific URL prefixes
• Influence results rankings programmatically for an unlimited number of URL prefixes

Interfaces and front end customization
I just changed my front end but when I view my results, they still show the old one.
What is wrong?
• A front end will only reload itself into memory every 15 minutes (or even longer).
Therefore, in order to force a reload of the front end, you must use the parameter
proxyreload=1 in the query URL at least once after the style sheet has been modified. This
parameter should only be used for a refresh during development and not in production as
it will negatively impact the performance of the Google Search Appliance.
How can I give developer access to the front end so that they can make changes
without being able to affect my KeyMatches, and so on?
• You can create two front ends, using some naming convention. For example, use the one
called "my_frontend" to manage KeyMatches, related queries, filters, remove URLs, and
OneBoxes (collectively known as "client"). Then create another one called
"my_frontend_ss" to manage the user interface (or output as it is denoted in the Admin
Console), which is referred to as "proxystylesheet".
• Give the UI developer access to "my_frontend_ss" only so they can update their style
sheet there.
• Retain control over "my_frontend" where user's search experience is managed by a non-
UI developer.
• Make modification to the application so it uses two front ends

(client=my_frontend&proxystylesheet=my_frontend_ss). By default, it would be
client=my_frontend&proxystylesheet=my_frontend.
How can I reuse some front end configuration?
• If you want to make use of most configurations in a front end for different user interfaces,
while you want to have different options for query expansion policies and/or result biasing
policies, do not create multiple front ends for this. Use "entqr" and "entsp" instead.
Other areas
I don't want documents with credit card numbers or SSN (or some other pattern) to be
returned in a search. How can I ensure this?
1. Export all crawled URLs.
2. For each URL, run a third-party program to make sure they are of good quality (that is, no
bad words, no sensitive information).
3. Programmatically add bad URLs into do-not-crawl list.
4. Repeat as often as needed.
or

• You can also filter on the search end of things, using OneBox as your regular expression
parser. If a user searches for a forbidden pattern (SSN, and so on), the OneBox triggers.
The OneBox provider is nothing more than an empty responder. The OneBox XSLT has
JavaScript to redirect the user to a standard "don't search for this, please" page. Or you do
something else in XSLT to prevent results from showing up.
• You can have the Google Search Appliance crawl through a proxy, and have the proxy
block content that matches specific patterns.
When using federation (dynamic scalability) between two or more Google Search
Appliances, do I require 'real' signed certificates?
• While the federation between Google Search Appliances can be done using the self-
signed certificates, we recommend that customers do not use them, but rather, use their
own 'real' signed certificates.
How can I see the XML that the Google Search Appliance is sending back before it gets
transformed?
• For results remove the proxystylesheet parameter and value. For example:
• http://gsahost.domain.com/
search?q=query&btnG=Google+Search&access=p&client=default_frontend&output=
xml_no_dtd&sort=date:D:L:d1&entqr=0&oe=UTF-8&ie=UTF-
8&ud=1&site=default_collection
• For dynamic results clustering, you can directly query the Google Search Appliance for the
XML output. For example:
• http://gsahost.domain.com/
cluster?q=query&site=default_collection&client=default_frontend&coutput=xml
How can I troubleshoot my Google Search Appliance because something isn't working
as expected?
• Have a look at the Google Search Appliance Admin Toolkit (http://code.google.com/p/gsa-

admin-toolkit/) which is a package of tools for Google Search Appliance Administrators, or
sniffer (http://code.google.com/p/gsa-http-sniffer/).
• This package includes numerous monitoring scripts, reverse proxies, admin scripts, and
so on.
How can we search across documents in different languages?
• Using the Cross Language Enterprise Search (http://code.google.com/p/cross-lang-ent-

search/) plugin for the Google Search Appliance enables you to convert your query from
one language into one or more different languages and search across multiple different
languages within your corpus of indexed content.
How can I toggle languages for a multi-lingual site?
• If absolutely need this, one should use a custom parameter to indicate a language choice
(such as "en" or "fr" or "es", and so on) for the search interface. The application should
receive that language preference and convert it into Accept-Language request header to
the Google Search Appliance.

How can I give non-administrators rights to run reports from production Google Search
Appliances without providing them with administrative access to the Google Search
Appliance itself?
• Create a simple HTML page, that calls a back-end program that uses the Admin Console
API to generate and export reports.
• Sync the Google Search Appliances logs with an external syslog service and create your
own reports.
• Note that this user will be able to change collection definitions.
How can I integrate the Google Search Appliance into a non-web application?
• The Google Search Appliance will accept HTTP requests, and can return XML (or other
formats after having been transformed by means of an XSLT). The returned results can
then be parsed by an application, written in the language of your choosing, and then used
for whatever purpose the application requires.
How can I reuse a site definition?
• Say PR manages a collection for "corp_cnt," marketing manages a second collection for
"mktn_cnt," engineering manages a third collection for "engr_cnt." There are two user
groups, one need "corp_cnt" and "mktn_cnt," and the other need "corp_cnt" and
"mktn_cnt." In this case, it is better not to create two collections for these two user groups,
because there are three distinct owners of these content. So, create three collections as
above. When search is done, use "site=corp_cnt|mktn_cnt" and "site=corp_cnt|engr_cnt"
separately.

Enterprise Search Satisfaction Survey Appendix C
1. What is your role?
• Engineering
• Finance
• Human Resources
• Sales
• Marketing
• Research
• <customer to fill in>
2. What percent of your time is spent looking for information?
• More than half my time
• A quarter to half my time
• 30 minutes to 2 hours per day
• 10 minutes to 30 minutes per day
• Less than 10 minutes per day
3. How often does your result show up in the top 10 (first page)?
• 100% of the time
• 80% of the time
• 50% of the time
• 20% of the time
• Never
121
4. How often does your result show up as the first result?
• 100% of the time
• 80% of the time
• 50% of the time
• 20% of the time
• Never
5. How often do you click on one of the Recommended Links (the shaded key matches at
the very top of the results)?
• Whenever I see one
• Sometimes
• Never
6. What would you like to see improved?
• Make the results come back faster
• Make the results more relevant
• Add more content
7. How is the query response time?
• Excellent
• Sufficient
• Unacceptable
8. Which content sources would you like to see indexed (added to the search results)?
• _________________________________
• _________________________________
• _________________________________
• _________________________________
• _________________________________
• _________________________________
9. Have you ever had documents that you knew existed but couldn't find them with
search?
• Yes
• No

10. How satisfied are you with your overall search experience?
• I love it—it works great!
• It's alright
• I'm not happy - it could be improved
• It's the worst search ever
11. What would you like to see added?
____________________________________________________
____________________________________________________
____________________________________________________
Enterprise Search Satisfaction Survey 123

Other Resources Appendix D
Google Group/User Forum
http://groups.google.com/group/Google-Search-Appliance
Google Solutions Market Place
http://www.google.com/enterprise/marketplace/
TSSG - Technical Support and Service Guidelines
https://support.google.com/enterprise/terms
Google Enterprise Support site
https://support.google.com (password required)
Google Search Appliance public documentation
http://code.google.com/apis/searchappliance/documentation/index.html
Google Search Appliance training site
http://www.learngsa.com
Remote access documentation
http://code.google.com/apis/searchappliance/documentation/remote_access/remote_access.html
125

GSA Deployment Guide

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GSA Deployment Guide

Uploaded by

Copyright:

Available Formats

Google Search Appliance

© Copyright 2009 Google Inc. All rights reserved.

2 Google Search Appliance Deployment Guide

Chapter 2: Understanding Your Deployment.................................................... 9

Chapter 3: Planning for Successful Deployment ........................................... 15

Chapter 4: Project Scenarios ........................................................................... 31

Chapter 5: Deployment Architecture ............................................................... 45

Chapter 6: Deployment Scenarios ................................................................... 61

Chapter 7: Post Deployment ............................................................................ 83

Chapter 8: Putting the User First ..................................................................... 95

Appendix A: Best Practices............................................................................ 103

Appendix B: Technical Solutions for Common Challenges ........................ 111

Appendix C: Enterprise Search Satisfaction Survey ................................... 121

Appendix D: Other Resources ....................................................................... 125

4 Google Search Appliance Deployment Guide

Welcome to the Google Search Appliance

Right for your business

Continuous increase in ROI

About this guide

6 Google Search Appliance Deployment Guide

• Approaches for planning and executing a Google Search Appliance deployment

• Architectural best practices

• Techniques to increase adoption and user satisfaction

• The foundations of a successful deployment

• Ensuring your deployment is optimized for support

• Designing an architecture to meet your technical and business requirements

• Supplementing the search appliance with enriching technologies

• Enabling core features to maximize value

Who this guide is for

How to use this guide

It is recommended that implementation of a search solution proceed with the support of a

Where to find the latest version of this guide

How to provide comments about this guide

Disclaimer for Third-Party Product Configurations

8 Google Search Appliance Deployment Guide

• Understanding your users, as described page 10

• Understanding your content, as described on page 11

• Understanding your business processes, as described on page 12

• Understanding your architecture, as described on page 12

Are there different groups of • Do they require specific search capabilities?

How will the users typically • Through a portal?

How can advanced search • Do you need mobile functionality?

10 Google Search Appliance Deployment Guide

To understand your content, consider the following questions.

Understanding Your Deployment 11

Understanding your architecture

12 Google Search Appliance Deployment Guide

Understanding Your Deployment 13

• Capturing requirements, described on page 16

• Identifying phases, described on page 22

• Defining success criteria, described on page 28

• Transitioning to business as usual, described on page 28

For scenario-based example deployment programs, see Chapter 4, “Project Scenarios.”

• User requirements, described in the following section

• Content and security requirements, described on page 18

• Performance and scalability requirements, described on page 20

• Administration and reporting requirements, described on page 22

• Usability, described in the following section

• Breadth and depth, described on page 16

• Communication and feedback, described on page 17

As you identify usability requirements, consider the following issues:

• The Google Search Appliance offers many simple-to-implement, on-box usability

• What speed requirements do users have?