Professional Documents
Culture Documents
Article information:
To cite this document: Marianne Stowell Bracke, C.C. Miller, Jae Kim, (2008),"Adding value to digitizing with GIS", Library Hi
Tech, Vol. 26 Iss: 2 pp. 201 - 212
Permanent link to this document:
http://dx.doi.org/10.1108/07378830810880315
Downloaded on: 30-04-2012
References: This document contains references to 13 other documents
To copy this document: permissions@emeraldinsight.com
This document has been downloaded 1489 times.
Access to this document was granted through an Emerald subscription provided by MANAGEMENT DEVELOPMENT INSTITUTE
For Authors:
If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service.
Information about how to choose which publication to write for and submission guidelines are available for all. Additional help
for authors is available for Emerald subscribers. Please visit www.emeraldinsight.com/authors for more information.
About Emerald www.emeraldinsight.com
With over forty years' experience, Emerald Group Publishing is a leading independent publisher of global research with impact in
business, society, public policy and education. In total, Emerald publishes over 275 journals and more than 130 book series, as
well as an extensive range of online products and services. Emerald is both COUNTER 3 and TRANSFER compliant. The organization is
a partner of the Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive
preservation.
*Related content and download information correct at time of download.
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0737-8831.htm
Adding value
to digitizing
with GIS
201
C.C. Miller
Purdue University Libraries EAS Library, West Lafayette, Indiana, USA, and
Jae Kim
Introduction
In November 2007 Purdue Universitys Agronomy Department will celebrate its
centennial anniversary. A number of events and programs are planned that will revisit
the history of agronomy, including Purdues contribution, while promoting some
understanding and awareness of the current state and future of the field. Agronomy
has changed much over the last 100 years through the use of progressively more
sophisticated technology and the increases in the understanding of soil properties.
Purdue Libraries will support the Agronomy celebration in a number of ways, but the
focus of this paper is a project that digitized the 1906 Soil Survey of Tippecanoe
County, Indiana, extracted its contents into full-text and geospatial datasets, and then
built them into a web application designed to approximate but improve upon the way
soil surveys are typically used by soil scientists in their research and field work. The
result is a mashup, of sorts, that pulls together pre-war analog agronomy and
twenty-first century digital techniques and technologies.
This paper will first provide some context for soil surveys, including their
importance to soil scientists and agronomists as well as the nature of the publications
Library Hi Tech
Vol. 26 No. 2, 2008
pp. 201-212
q Emerald Group Publishing Limited
0737-8831
DOI 10.1108/07378830810880315
LHT
26,2
themselves. Following this it will outline the challenges and charges of the project, its
ultimate goal, and the trajectory of the work by Purdue Libraries. This work includes
the process developed and the architecture and functionality of the alpha pre-release of
the online application. Finally, it will include a section that discusses future directions
for and potential impacts of, the project, including some reflection about where this
project fits into novel developments in interdisciplinary librarianship such as GIS.
202
Soil and soil surveys
Soil is more than the dirt beneath our feet. It is a living membrane around the earth and
features a complex ecosystem of mineral and organic matter, liquids, and gasses. A
type of soil and its geographic location is important information to farmers, engineers,
land developers and agronomists, among others, as they plan current and future uses
of the land. The physical, chemical and biological properties of soil vary highly across
the spatial extent of even small regions. Soil also varies downward, through co-reliant
strata that result from changes in the earth over long periods of time. The ability of
soils to tell epic stories about the earth has perhaps never been so important as in the
last century, as this country experienced dramatic shifts in land use from the rapid
growth of cities, the decline of family farming, greater productivity through intense soil
management practices, and the rise of massive commercial agriculture (Richter and
Markewitz, 2001).
Soil surveys are very detailed reports that describe the physical and chemical
properties of the soil over a given area, typically a county. Surveys are usually
published for each county every 40-50 years, the interval at which soils are
re-examined and sampled. As detailed catalogs of the telling elements of soil, soil
surveys are an important tool in understanding the changes in and current status of
our earth and the activities that engage and alter it. Surveying began in the USA in the
last part of the 19th century, at a time when the surveying was a very manual and
time-consuming endeavor. Surveyors walked, boated, rode horses or buggies or chuck
wagons or bicycles into the field to gather data with a heavy complement of plane
tables, augers, picks and shovels. New techniques were added over the years, among
them the use of aerial photography for soil mapping, that greatly improved the
accuracy and level of detail and sped up the surveying process itself (History of Soil
Survey, 1999).
Survey publications present information in a variety of ways, including narrative
descriptions of the area and soil types; suggested use and management of soils; and
interpretive tables, aerial photography, and maps depicting the zones within a county
dominated by a given soil type. Over the past century, as soil knowledge has expanded
and surveying technology has improved, soil surveys have become more sophisticated
with longer text and greater numbers of photos and maps. The 1906 Tippecanoe
County Soil Survey, for example, is 40 pages long with a single map. The updated
survey of 1959 is 117 pages with 13 maps. By 1998, the Tippecanoe survey ballooned to
350 pages with 84 maps.
The National Resources Conservation Service (NRCS), a child of the United States
Department of Agriculture (USDA), publishes the surveys and coordinates data
collection across the country. Although NRCS goal is to publish all parts of the survey
online, new surveys are being published in part online and in print (GPO Listserv,
2007). Online materials have the added benefit of a sophisticated array of web tools to
allow researchers to view modern soil data in numerous ways. Furthermore, data can
be downloaded to desktops or to rugged tablet PCs and used in the field.
Why reclaim historic surveys?
Soil surveys tell stories with multiple timelines. Any one publication catalogs the
properties of the soil in one county for a given time, but taken collectively surveys can
illustrate how glacial action thousands of years ago changed the soil, as great layers of
ice compressed the earth beneath then displaced and dispersed silt and sand as it
melted. Looking at historical data of even the past 100 years, researchers can see how
and where the soil has changed, a useful tactic in predicting how certain interactions
with the land farming, civic development, recreational might affect the soil in the
future. But there is the rub: looking at data from 100 years ago is more difficult than it
sounds. Most work with soil is now done with electronic data, and even if one wanted
to use the 1906 survey in their work, they necessarily must use fragile, brittling, analog
materials and certainly at some stage do some manner of digitization. The Libraries
felt that the unique combination of traditional text and map data made the soil surveys
especially good candidates for a fully digital resurrection, one that could not only make
the data available to agronomists electronically, but also boost the discovery and
consumption of them by a more general audience.
Because of the continued usefulness of soil data, it was important to extend the life
of this project beyond the anniversary celebration itself. To that end, this project
intends to provide access to, preserve, add value to, and expose to the modern research
workflow a collection of materials that have much to say about soil, land use, human
development, husbandry, and a collective understanding of the land. Unfortunately,
the use of these materials had atrophied as they became increasingly dusty, forgotten
analogs of more modern data. By resurrecting the materials into a rich,
multi-functional web environment rather than one monolithic PDF file that perfectly
duplicated the original, the materials could be introduced into the research workflow
and the documents functionality could be constructed with specific uses in mind.
Methodology: personnel and narrative phase
The Libraries team for this project is led by the Agricultural Sciences Information
Specialist and the Geographic Information Systems (GIS) Specialist. They are
supported by two of the GIS Specialists graduate students (map digitization) and
members of the Libraries Archives and Special Collections faculty and staff (scanning
and metadata). The first task for this team was to learn just how soil surveys are
usually consumed. Conversations with an ad hoc council of Purdue agronomists
revealed that survey documents are typically used map-first. That is, one locates an
area of interest on the map to discover which soil type(s) cover that region, and then
armed with a given soil type or types returns to the narrative portion of the
document to investigate those soils. The narrative portion of the document is organized
into chapters by soil class. Each soil in the county gets a chapter in which its
characteristics and suitability for different uses are discussed with occasional asides
for statistics, charts, images and comparisons to other soils. The map, naturally, is
organized spatially. Though there are incidental elements including roads, political
boundaries, and survey grids, the chief attribute of the map is its continuous surface of
soil zones, each of which matches one of the soil classes that form the backbone of the
Adding value
to digitizing
with GIS
203
LHT
26,2
204
narrative. In order to link these two document components together, at the very least
that shared element, the soil class, needed to be fully exposed.
For the narrative portion, the process of exposing the soil classes used common
digitization practices: the pages were scanned at 600 dpi and ingested into the
Libraries institutional repository, a CONTENTdm installation, where it was made
available and accessible via human-written qualified Dublin Core metadata and a
full-text index generated by a round of OCR, and with several CONTENTdm gadgets
such as query term highlighting, resolution scaling, and PDF export (http://earchives.
lib.purdue.edu/u?/TippeSoil,126).
Methodology: map phase
Exposing soil classes on the map was a much more involved operation, as the target
unit was not a typeset string of text but a highly-variable geometric shape, drawn by
hand in 1906 and available now on a yellowed, 24" 30" page. The following section
will overview and elaborate on the process used to extract spatially-aware, classified
soil data from this original analog map sheet.
Methodology: map phase: georeferencing
Although the scan of the text portion of the document was performed using Purdues
own copy of the Tippecanoe survey, that documents map had been used to a point of
disrepair, and the team kindly thanks the University of Illinois at Champaign-Urbana
for loaning Purdue Libraries their copy of the map for scanning. This scan at 600 dpi
and 32-bit color generated an uncompressed.tif of , 730 MB. This image was then
loaded into ESRIs ArcInfo GIS for georeferencing. Georeferencing the process of
taking raw image data and manually fixing it to its proper place in some known
geographic coordinate system was by far the most time-consuming portion of the
process. It did, however, enable what is arguably the most useful result of the project:
the ability to import these old data into a modern GIS, perfectly overlaying or
overlayed by modern iterations of soil or other data. To complete the georeferencing
process, ArcGISs Adjust algorithm was used, which combines polynomial
transformations and Triangulated Irregular Network (TIN) interpolation in an
attempt at both global and local accuracy (http://webhelp.esri.com/arcgisdesktop/9.2/
index.cfm?id 2698&pid 2689&topicname Georeferencing_a_raster_dataset).
In total, 330 control points were placed during this process (locations on the soil image
that correspond to identifiable locations a map with established accuracy) that resulted
in a Root Mean Square (RMS) error a measure of difference between known and
unknown locations of zero. It is useful to reiterate here that accuracy is a more
nebulous concept for soil data than it is for other types such as a GPS tracklog or even
road data. Soils on maps are represented as edged, non-overlapping polygons which
are themselves estimates based on spot samples. Real soil is never so finite, of course,
so there is a certain amount of fuzziness and interpolation built into a soil datum to
begin with which is further exacerbated by the crudity of 1906 instrumentation and
understanding. In other words, getting the 1906 map to line up perfectly with modern
map data is no real favor to the true accuracy of the resulting digital map. This is a
useful self-critique to note, however, as reducing the time spent placing control points
during the rectification process will significantly reduce the amount of time it takes to
process future documents. That said, mapping, surveying, and other soil survey
Adding value
to digitizing
with GIS
205
LHT
26,2
206
Figure 1.
Close-up of an obvious
member of the black
classification
could then be filled in more or less automatically using a procedure that will be
discussed later in this paper (see Figure 1).
The presence of the black classes necessarily affected the parameters given to and
guessed by the segmentation process. Segmentation in Definiens Professional is an
iterative, heuristic image optimization algorithm that considers several input
parameters in an attempt to minimize local heterogeneity in favor of identifying
more logical, higher-order homogenous objects (Definiens Professional, 2006). Scale is
arguably the most important of these parameters, as the growth of any object
(growth here means the gathering of pixel after pixel until it is determined that the
last pixel of a given class has been reached and the next one belongs to a different
class) stops if it exceeds the scale input parameter. Several digitizations were
attempted with several sets of input parameters before a setting was found that best
captured both the soil zones and the black classes (see Figure 2).
The critical step in the segmentation process was classification, wherein samples of
map data were manually added to or subtracted from each temporary soil classification
in order to train the softwares interpretation of the map data. Not only was this an
important step toward a more accurate final result (where vector soil polygons equate
very closely to the original images soil zones), it also provided the most return on
invested time. Taking care and having the patience to train Definiens Pro well at this
stage helped it generate a vector version of the map that had the most efficient
combination of smoothed, encompassing polygons (large, continuous soil zones that
swallow up very small, neighboring patches that are in truth of the same soil type
but which appear slightly differently on the scan due to the vagaries of hand-drawing a
map or from blotches, creases in the paper, etc.) and small, but distinct patches of soil
that are indeed supposed to be separate from larger neighbors. In fact the total
accuracy report for the map following classification was 99.48 percent, with all classes
at or above 97 percent save two at 93 percent and 76 percent, respectively.
Adding value
to digitizing
with GIS
207
Figure 2.
Close-up of the map image
before and after
segmentation
LHT
26,2
208
Figure 3.
Gaps of unknown soil type
(areas in white) left by
removal of black classes
The topology was established using a cluster tolerance parameter (i.e. how close do
features need to be in order to be included in the rule?) of 25m and one topological rule.
Too much cluster tolerance and the carefully-segmented polygons get fairly routinely
warped; too little and almost as many gaps between the soil polygons would remain as
before the process started. The one, lonely rule dictated that the soil polygons Must
Not Have Gaps. Validation of the rule generated an inventory of every blank space in
the entire dataset, which counted well into the 400s. Having to fix these gaps by hand
would have legitimately derailed the viability of this project, and this gloomy prospect
warranted an investigation of several ways to automate the process of closing the gaps
between polygons left behind by the removal of the black classes. Ultimately the best
way was to have ArcGIS force fix the errors by filling in gaps with additional
polygons. Where two polygons had not met to share a border, for example, a new, filler,
shape would be inserted to close up the surface. In this way, the quite highly-accurate
borders of the polygons generated out of segmentation would be preserved. However,
the filler polygons had no class, as it made no sense to assign a default soil class to
these new shapes upon their creation since there was no way to predict to which of
their neighboring soil classes the new shapes belonged.
To assign soil classes to these filler objects, the filler shapes were converted into
points, each of which represented a filler objects centroid. These centroids were then
able to act as vessels, in a way, of soil class attributes. A spatial join transferred to each
point the soil class of the polygon to which it was closest. Another spatial join then
transferred that attribute to the filler polygon from which the point was derived.
Assigning the class of the nearest polygon is not a foolproof way to accurately identify
what soil class was actually displaced by the black classes, but a manual spot-check
check of those newly-reclassified filler polygons against the original map found that a
great majority were right.
One last run over the map with a human eye helped to properly classify
misclassified filler polygons and fix small errors or anomalies in the original soil
polygons (mostly border jags where a road or letter had distorted the edge). All
disjointed polygons of the same soil class were then dissolved into whole, but
multipart, polygons. At that point the soil extraction was finished, FGDC metadata
were written, and everything was moved to a PostGIS database (PostgreSQL with
geospatial capabilities) to be drawn out and rendered by a geodata server.
With the data suitably prepared it was time to focus on building up interaction
between the new map layers and the soil survey narrative. MapServer is the rendering
engine for this project, as it is a well-established, open source, relatively easy-to-use
geodata server and it supports a number of open standards and many modern data
formats (http://mapserver.gis.umn.edu/). Although there are a number of
pre-fabricated or templated interfaces that display MapServer output and allow
additional functionalities to be coded in, in order to stay abreast of late developments in
user interfaces to geospatial data in the context of a swift Web 2.0, as well as build very
specific, unusual functionality (e.g. linking from map data to text), an additional layer
of software called ka-Map! was placed between MapServer and the user (http://ka-map.
maptools.org/). ka-Map! is a young, open source JavaScript application programming
interface (API) that tiles and precaches MapServer output and thus gives MapServer
output a faster, more agile interface more akin to Google Maps and Yahoo! Maps, but
potentially with more flexibility. For this project, using ka-Map! promised an improved
user experience and opened up additional development avenues.
Coding by project staff focused on linking the maps soil polygons to the
corresponding sections of the narrative in eArchives, the CONTENTdm-driven module
of the Libraries institutional repository. Rewriting ka-Map! code to account for this
was not an altogether difficult process; with minor alterations to php and JavaScript
functions and some interpretation of the URL parameters used internally by the
CONTENTdm installation it was possible to send map attributes as search terms to
CONTENTdm and let that system behave as though the terms came from its own web
forms. Querying or keyword searching of the map data identifies a soil class or classes
and presents three links. Clicking the first zooms the map to the given soil. The second
link dynamically constructs a URL that calls on the narrative document in eArchives.
CONTENTdm constructs an image from the narrative documents using its own word
highlighting and scalable resolution and that image gets returned to the map interface
for the user to preview. A click on this preview image sends them on to that page of the
document in the native eArchives interface. The third link bypasses a preview and
goes straight from the map interface to the first search hit of the name of the identified
soil class within the narrative document (see Figure 4).
This alpha preview was presented to an ad hoc council of soil scientists and
agronomists in April 2007. Interface enhancements, on-map metadata display, and
supporting documentation were still underdeveloped at this time, but the reaction from
this group was positive and important to the future development of the project. The
final section of this paper will discuss this reaction and present a kind of prospectus for
work that will extend beyond the November 2007 Agronomy Centennial.
Future directions
Contrary to team expectations, the agronomists did not ask for additional
map-to-narrative functionality or even a link from narrative back to map. Instead,
they asked for:
Adding value
to digitizing
with GIS
209
LHT
26,2
210
Figure 4.
The result of a map query:
preview of the narrative is
drawn into the map
interface with the first
found instance of the soil
class
more map layers, from several eras of soil study and soil mapping technological
development and of germane historic interest; and
a greater assisted ability to connect map data with non-map data.
The former was no real surprise. Those familiar with GIS are already aware that one of
the great powers of GIS is the ability to pull together geospatial data from different
eras and different sectors of the information universe. The second request illustrated
the savvy of the agronomists and confirmed a direction for the future of the project that
had been discussed by the team but set aside pending more progress on the application
itself. In advance of the teams meeting with the soil scientists, as an experiment, an
extra layer of map data was included that sat atop the new soil polygon data, which
itself was layered over the original map image. Three dots occupied that top level, each
of which presented a small popup window when hovered over with a mouse. One dots
popup contained a citation for an article about the soils and vegetation of pre-Euro
settlement in Indiana. The second dots popup presented several paragraphs of
example dummy text. The third popup contained a movie about soil sampling
procedures created several years earlier by a Purdue Extension program. The three
dots and their accompanying popups had been floated into the map on a stream of
xml data extracted from a wiki, a Content Management System (CMS), and an external
website, respectively. The agronomists interest in more functionality of this sort
speaks to the heart of what Curran et al. (2006, 2007) and indeed countless others have
addressed: the capabilities of the more modern, mashable web have altered the
expectation and vision of researchers in libraries just as much as it has gaggles of
MySpacers and bloggers. These agronomists were seeing that the map, almost as
much as any other document or document metaphor online today, could be a platform
for content shared, contributed content and be content at the same time.
The agronomists interest in contributing to the online app verifies sentiments
within GIS literature that directly or indirectly address that part of the Public
Participation GIS (ppGIS) movement devoted to more implicit, less explicit geospaces
(Talen, 2000; Harris and Weiner, 1998; Sieber, 2004; Miller, 2006). These and other
ppGIS scholars have posited that a more democratic GIS would be one that could
account for and include conceptions, constructions, and uses of space that are perhaps
not explicitly geospatial or are at least not of the Euclidean, mathematical lineage that
is the default target of GIS functionality. In some ways, building this soil project with a
future in mind that has the map behaving as not only a resurrected library asset, but as
a sounding board for different communities of stakeholders; agronomists in Purdues
Agricultural Sciences departments, for sure, but also farmers, historians, and land use
analysts, a smattering of ppGIS ideals would be adopted into an environment that also
experiences exclusionism and top-heavy support for GIS, a problem to which Purdue
Libraries has begun to pay attention. Maps are rich, almost always interdisciplinary
documents, and their important earth science data are complemented by information
additionally useful to those with more secular interests in community-building, local
history, archaeology, genealogy, and other campus disciplines. The map mashup
model, easily within the ethic of read/write Web 2.0, works just as well, if not especially
well, in academic environments where the fount of mashable information is richer (and
arguably more esoteric) and includes discipline expertise, other library collections,
external datasets, and a potential user population with highly variable experience with
traditional GIS work. Much of this is fodder for the models of a ppGIS or GIS/2 that
have been written about over the last ten years. The result of the soil survey
application a kind of floating geobibliography will be another step toward
legitimizing the mashup as a viable method for providing access to multiple
collections, and, better still, engaging library users.
So work in the immediate future will focus on preparing at least a 1.0-level product
for the November centennial. The following year should see two forks of work on this
project. The more practical of these will be the likely decision to progress to the 1957
Tippecanoe survey, a much more complex document at least in terms of its maps and
the geometries therein. Other work, though, will focus on: adding additional
explicitly-spatial content germane to the area and topic; and adding user-input
methods and widgets that will allow agronomists or other interested parties to add
literature citations, annotations, or their own map layers. This is in anticipation of a
future where more and more library collections will transform into interactable,
mashable complements to one another and to the wider world of data and information.
The richness of maps as texts, their almost inherent interdisciplinariness, and their
potential to become shared space to any number of interested parties, suggests that
library users have plenty to say not only about the maps themselves, but to each other;
through, across, and over geospace and time.
Adding value
to digitizing
with GIS
211
LHT
26,2
212
References
Curran, K., Murray, M., Norrby, D.S. and Christian, M. (2006), Involving the user through
Library 2.0, New Review of Information Networking, Vol. 12 No. 1, pp. 47-59.
Curran, K., Murray, M. and Christian, M. (2007), Taking the information to the public through
library 2.0, Library Hi Tech, Vol. 25 No. 2, pp. 288-97.
Definiens Professional (2006), Definiens Professional 5 User Guide 2006, Definiens AG, Munich.
GPO Listserv (2007), Format changes for the soil survey reports, Government Printing Office
Listserv, available at: http://listserv.access.gpo.gov/cgi-bin/wa.exe?A2 ind0704&L
gpo-fdlp-l&P 1537 (accessed July 27, 2007).
Harris, T. and Weiner, D. (1998), Empowerment, marginalization, and community-integrated
GIS, Cartography and Geographic Information Systems, Vol. 25 No. 2, pp. 67-76.
History of Soil Survey (1999), United States Department of Agriculture Natural Resources
Conservation Office, available at: http://soils.usda.gov/partnerships/ncss/history.html
(accessed July 20, 2007).
Miller, C.C. (2006), A beast in the field: the Google Maps mashup as GIS/2, Cartographica,
Vol. 41 No. 3, pp. 187-99.
Sieber, R.E. (2004), Rewiring for a GIS/2, Cartographica, Vol. 39 No. 1, pp. 25-39.
Talen, E. (2000), Bottom-up GIS: a new tool for individual and group expression in participatory
planning, Journal of the American Planning Association, Vol. 66 No. 3, pp. 279-94.
Richter, D.D. Jr and Markewitz, D. (2001), Understanding Soil Change: Soil Sustainability over
Millenia, Centuries, and Decades, Cambridge University Press, Cambridge.
Tso, B. and Mather, P.M. (2001), Classification Methods for Remotely-Sensed Data, Taylor &
Francis, New York, NY.
Further reading
Bushnell, T.M. (1944), The Story of Indiana Soils: with Descriptions of General Soil Regions and
the Key to Indiana Soils, Purdue University Agricultural Experiment Station, Lafayette,
IN.
Maness, J.M. (2006), Library 2.0 Theory: Web 2.0 and its implications for libraries, Webology,
Vol. 3 No. 2, available at: www.webology.ir/2006/v3n2/a25.html
Corresponding author
Marianne Stowell Bracke is the corresponding author and can be contacted at: mbracke@
purdue.edu