You are on page 1of 47

Digitization Practices in

India: Issues and


Challenges

V.N. Shukla

C-DAC, NOIDA UNIT


NATURAL
NATURALLANGUAGE
LANGUAGE
PROCESSING
PROCESSINGAND
AND
INTERFACES
INTERFACES
INFRASTRUCTURE
INFRASTRUCTURE
AND
ANDSUPPORT
SUPPORT
SERVICES
SERVICES

MISSION
C-DAC

HUMAN
HUMANRESOURCE
RESOURCE
DEVELOPMENT
DEVELOPMENTIN
IN
HITECH
HITECHAREAS
AREAS

SPECIAL
INDUSTRIAL
APPLICATIONS

AREAS OF COMPETENCE
Graphical Display
System

NLP
E-Governance

Internet on CATV
& E-Commerce
Solar Energy
System

NOIDA
NOIDA
.

Security
Systems

Embedded
System
System Engineering and
Consultancy
3

Digital Library Activities : CDAC Noida


Digital Library Projects
Mega Centre for Digital Library
Mobile Digital Library : Dware Dware Gyan Sampada
Digital Library at Presidents House
Digital Library at Nagari Pracharini Sabha Varanasi
Digital Library at Uttaranchal
GyanNidhi : Multilingual Parallel Corpus in Indian Languages
Digital Library at Gujrat Vidyapeeth ,Ahmedabad
Digitization of Libraries

Digital Library Mission


To organize the information and make it universally
accessible and useful.

Online Content

Offline Content

Billions of web pages

Billions of items still unindexed

DL Initiatives
Only ~15% of books are in print

~85% of books are out of print


and/or out of copyright these
books are only found in libraries

GOAL: Create a comprehensive virtual card catalog of all


books in all languages, while respecting publishers rights
Source: Google

Digital Libraries

Users

Hyperlinks
Index

Metadata
Search

DL creation &
processes

Traditional Libraries
I
N
D
E
X

A Typical Library Collection


The value is in the middle
15%
~15%

In-Print

~65% or more

Unclear copyright status


May be in copyright, but not for sale
Rights may have reverted to author
May be in the public domain

Less than 20%**

Public Domain

92% of the world's books are neither generating revenue for the
copyright holder nor easily accessible to potential readers.*
*Source: Covey, Denise Troll. "Global Cooperation for Global Access: The Million Book Project

**OCLC analysis of the Google Books Library Project: http://www.dlib.org/dlib/september05/lavoie/09lavoie.html

Digital

Library (DL) may be seen as


Collection of intelligent creations by human
beings through their own language and
culture. It also reflects cultural heritage
besides providing archive and generating
many research issues pertaining to Natural
Language Processing

Digital Library ?
Sun Microsystems defines a digital library as the electronic extension of
functions users typically perform and the resources they access in a
traditional library.
These information resources can be translated into digital form, stored
in multimedia repositories, and made available through Web-based
services.
According to other definition Digital libraries are
Organizations that provide the resources, including the specialized
staff, to select, structure, offer intellectual access to, interpret,
distribute, preserve the integrity of, and ensure the persistence over
time of collections of digital works so that they are readily available for
use by a defined community or set of communities.

What is Digital library ?

A Service? An Architecture?
A set of Information Resources?
A set of tools to locate, search, retrieve
information?
Possibly the tools to create such resources and
services also fall within the purview of DLs
Digital face of traditional libraries
Include both digital collections and traditional
Backbone and nervous system of libraries.

Digital library Vs traditional library

Efficient & qualitative services by collecting, organizing,


disseminating, retrieving and preserving the information.

storing,

Preservation benefits besides making information retrieval & delivery more


comfortable.
Online access to historical and cultural documents whose existence is
endangered due to physical decay.
Digital libraries necessarily include a strong focus on the management of
digital content, just as traditional libraries have focused for long on the
management of content in physical forms.

Digital Content Management


Most of the digital content that is being managed includes:
Human Language, in various forms character-coded electronic text, scanned
images, printed or handwritten text or human speech.
Language technology helps in managing digital content
Management through learning from past experience also adds to manage
content
The major areas for great exploitation are:
Information retrieval,
multimedia,
database,
data mining,
data warehouse,
on-line information repositories,
image processing, hypertext,
World Wide Web and wide area information services (WAIS).

Few advantages of digital libraries


Access anywhere
Reducing delays
Distributed storage central access
Better cataloguing
Cross references to other documents
Full text search
Protected information source
Wide exploration and exploitation of the information

The information explosion, the wide bandwidth data networks and the potential
of Internet-based technologies - such as the Web - make digital libraries one of
the important application areas of computer science.

Process of Digital Preservation


Centralized
Centralized
Server
Server
XML
XMLMeta
MetaFile
File
Creation
using
Creation using
Dublin
Dublincore
coreStd.
Std.

Scanned
Scanned
Image
ImageininTIFF
TIFF
format
format

Reject
Rejectthe
the
Book
Book

Book
Bookscanning
scanning
status
status

Yes

S/w
S/wtotodivide
divide
even
&
even &odd
odd
pages
pages

Conversion
Conversiontoto
TXT/RTF/HTML
TXT/RTF/HTML

Uploading

No

Batch
Batch
cropping
cropping&&
Cleaning
Cleaning

OCR
OCR

Goals of DL

Focused on digitization technology, metadata


schemes, data management techniques, and
digital preservation.
Second-generation digital library

exploring new opportunities and developing new


competencies.

Third-generation digital library

focusing instead on fully integrating digital material


into the librarys collections through a modular
systems architecture.

Ingredients for DLs

Hardware
The minimum machinery to do the job

Software
The programs for handling data

Digital Objects
Articles, Conference Papers, Thesis,

Basic Skills
Things one has to learn

Hardware
A

Server
Youll need access to a web server

good PC
Scanners
Flatbed Auto feed, Back to back
MF

Book Scanner

Software

Open Source Software (OSS)


Dspace, E-Prints, Fedora, GSDL

Proprietary software you cant avoid


Image Editing and Optical Character Recognition Software
have to be purchased

Content is King

The information content is


more important than the
systems used for its storage,
management and retrieval

Objects should not be locked


in specific DLs or archives

Creating DLs
Six

steps

Selecting
Acquiring
Digitization
Creation Of Meta Data
Organizing
Archiving

Providing Access

Possible Delivery Formats

Pure image formats: TIFF, JPEG


Open encoded formats: XML, HTML, ASCII, and
Unicode
Hybrid formats: PDF, DjVu can contain both image and
text
Proprietary formats: Microsoft Word, WordPerfect

Digitization: Issues

Copyright
Access

copy and archive copy


File size
Storage media( CD, Hard disc)
File format ( TIFF,JPEG)

Challenges in Digitization

Building digital collections of national importance from


existing texts, documents, images . . .

Creating new digital documents & linking them

Subject portals: Selecting and maintaining open source


digital resources

Developing / adapting management tools for digital


collections

Providing access to digital collections


25

Challenges..

Integrating digital & other library collections

incl. integration of OPACs, subscribed e-resources and


subject portals

Establishing services for digital libraries

online access & offline support

education & training of users and librarians

Addressing social, legal, policy issues


26

Challenges in Publishing

Preservation of layout

Searchability of content and metadata

Efficient image compression

Easy browsing of books

Accommodating low bandwidth user

Multilingual text support

Multipaging

Digital Library Support in India


Funding
Ministry of Communication & Information Technology
(MIT)
Ministry of Human Resource Development (MHRD)
Manuscript Mission of India
Department of Scientific & Industrial Research (DSIRTRP)
All India Council for Technical Education (AICTE)
University Grants Commission (UGC)

Digital
Library Initiatives in India

Library Consortium in India


Scholarly Science Journals
Theses & Dissertations
Institutional E-Print Archives
Books (out of copyright)
Manuscripts
Newspapers
Online Courseware
Open Access at Metadata Level
Portal and Gateway Services

29

Government of India
Min. of C&IT
Universal Digital
Library

Min of Culture
National Manuscript
Library

Others
CSIR E-Journals
Consortium
INDEST-AICTE
Consortium
UGC Infonet
Consortium
FORSA
Consortium
IIM Libraries
Consortium

Participating centers of DLI


PTU-1
PTU-2
PTU-3

Rashtrapathi
Bhavan
CDAC Noida

ERNET

IIIT-Allahabad

Digital Library of India


CDAC Kolkata

MIDC

Pune University
IIIT-H
State & City
Central Library
University of Hyderabad

Goa University

IISc

TTD Tirupati

Sringeri Mutt
IISc, IIAP,
PoornaPragya

ASR Melkote

AKCE

Anna University
Kanchi Mutt
SASTRA

Mega Scanning Centres at


IIITH, IIITA
CDAC- Noida and Kolkatta

Digital Library Initiatives in India

Some Examples

Digital Library of India


http://www.dli.ernet.in/

April 20, 2009

Workshop on Institutional Repositories

33

http://www.ias.ac.in/

April 20, 2009

Workshop on Institutional Repositories

35

http://www.insa.ac.in/

April 20, 2009

Workshop on Institutional Repositories

36

http://medind.nic.in/

April 20, 2009

Workshop on Institutional Repositories

37

April 20, 2009

Workshop on Institutional Repositories

38

39

Manuscripts

India has the largest collection of manuscripts in the world (5 million


Approximately).

India is the repository of an astounding wealth of ancient knowledge


belonging to different periods of history, going back to thousands of
years. Most of this knowledge belonging to different areas of
intellectual activity such as religion, philosophy, science, arts and
literature is preserved in the form of manuscripts. Composed in
different Indian languages and scripts, they are preserved in materials
such as birch bark, palm leaf, cloth, wood, stone and paper.

National Manuscript Mission was launched five-year programme in


Feb., 2003 by the Ministry of Human Resource Development, Govt. of
India to get all the manuscripts and conserve them.

http://namami.nic.in/

Archives of Indian Labour


V.V. Giri National Labour Institute
Heritage of Indian Working Class
Commissions on Labour
Oral History Collections
Trade Union Collections
Regional Collections
Strike Collections
Powered by Green Stone Digital
Library
http://www.indialabourarchives.org/
43

Digital Libraries Benefits : Individual


Gain access to the holdings of libraries worldwide
through automated catalogs. Locate both physical
and digitized versions of scholarly articles and books.
Optimize searches, simultaneously search the
Internet, commercial databases, and library
collections.
Save search results and conduct additional
processing to narrow or qualify results.
From search results, click through to access the
digitized content or locate additional items of interest.
All of these capabilities are available from the
desktop or other Web-enabled device such as a
personal digital assistant or cellular telephone.

Conclusion

Digital Libraries are redefining the role of libraries in society


& the role of librarians & information specialists

National level mechanism is essential to promote and


coordinate open access and public domain digital library
systems

Improve awareness of open access


Regular training tools, processes, standards
Support setting up of working models, services
National Resource Centre for open access publishing

International agencies like UNESCO, ICSU, ICSTI, CODATA


need to actively promote and support developing country
initiatives

References

Digitization Of Library Forum Survey 2010. IT Act .


Available at www.mit.gov.in/it-bill.htm.
A digital library for education: the PEN-DOR project. The
Electronic Library, 17(2), 75-82.
Government of India. 2000. Background Report on IT
for Masses itformasses.nic.in/vsitformasses/page1.htm
Government of India. 2000. IT for the Common Man: The
Millenium IT Policy. Department of Information.

Thank
You

You might also like