You are on page 1of 29

DataFlow VIDaaS Launch Event Sad Business School, Oxford University

2 March 2012

The JISC UMF DataFlow Project


http://www.dataflow.ox.ac.uk

Introduction to DataStage
David Shotton (PI, JISC UMF DataFlow Project)

Image BioInformatics Research Group Department of Zoology University of Oxford, UK


http:/ibrg.zoo.ox.ac.uk e-mail: david.shotton@zoo.ox.ac.uk
David Shotton, 2012 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence

And the winning platform is . . .

At Queen Mary College, there is a JISC MRD Project entitled

Sustainable Management of Digital Music Research Data


http://rdm.c4dm.eecs.qmul.ac.uk/

After carefully reviewing several data management systems last December, including Fedora Commons, DataVerse and DSpace, they concluded:

On paper, DataFlow is a winner: it meets (almost) all our requirements, especially because of DataStage, something other platforms don't offer. DataStage would be particularly appreciated, because it would make the integration of the system in the research workflow much less disruptive. Sadly, the availability of DataFlow software will come too late to be useful for our short project (October 2011 March 2012).

Well, now the DataFlow software systems, DataStage and DataBank, are available, and we hope they will meet the needs of many of you here

Why dont researchers publish data?


Three pressures presently prevent researchers from publishing their data

Information overload and pressure of work


With twenty new papers each week, a researcher can never catch up there is just too much new scientific information being produced now Have to run to stand still - no time for fringe activities like data curation

Departmental pressure for financial viability, determined by the REF

pressure to win grants and to publish in high impact journals

negligible incentives and academic reward in terms of peer esteem, tenure or promotion for data publication activities

Cognitive overhead and skill barriers to best-practice data management


metadata concepts are foreign to most biomedical researchers large amount of effort involved in preparing data for publication

[From evidence submitted 5 August 2011 to the Royal Societys Science as a Public Enterprise policy study]

Easing the pain of data archiving and publication

Making data management as simple as possible


- the principle of sheer curation
(http://en.wikipedia.org/wiki/Sheer_curation) Create a data management infrastructure that:

works with you rather than against you

accommodates the data management tools with which you are already familiar (e.g. spreadsheets) provides services that are of immediate benefit in your day-to-day activities (e.g. shared file access) makes data management, data publication and data archiving activities sufficiently lightweight, intuitive and transparent that they are easily achieved, without imposing a significant cognitive overhead
By achieving this, we can bridge the gap between laboratory and repository

Managing data using a two-tier infrastructure Tier One: DataStage

Researchers can save files to a secure private DataStage file store This is purely for their own benefit Just a file store - does not pose a cognitive overhead sheer curation Requires no software installation on the researchers computers

Designed for deployment at the research group level, locally or on a cloud


Primary access is as a mapped network drive, Drive D:, on each computer You save files to DataStage just as you would to your local hard drive No restrictions or limitations of file type whatever you normally use

Web access allows users to browse files within DataStage


Advantages over a cheap hard drive from PC World under your desk: Regular nightly automated backup no need to remember to do so Private, shared and collaborative areas, with controlled group access Additional Web interface to DataStage, using the same user credentials Can invite overseas colleagues to access your files, via password control

Managing data using a two-tier infrastructure Spanning the tiers: DataStage to DataBank

The special Web submission interface permits researchers to select and package data files for publication and long-term repository archiving

Easy to do When the researcher is ready Minimal metadata requirement, to encourage usage

The selected files are put in a special directory, with optional sub-directories

The files are accompanied by a simple metadata stored as an RDF manifest


It is possible to represent data files stored elsewhere using URIs

useful for large data files that already have stable storage locations

Packaging uses the BagIt file packaging specification from the California Digital Library (https://wiki.ucop.edu/display/Curation/BagIt) The resulting files are then zipped into a single object for transmission to DataBank, the institutional data repository

Managing data using a two-tier infrastructure Tier Two: DataBank

DataBank is a scalable data repository designed for institutional deployment Developed by the Bodleian Library, with a track record in preservation Cloud-deployable Easy for researcher to update a revised dataset if required Data packages normally published under a CCZero Open Data Waiver Confidential data packages can be kept in a separate dark repository Data packages assigned DOIs, making them citable (for academic credit) Optional user-defined embargo period to permit journal article publication

Upon receipt of a DataStage data package, DataBank unzips the data package to give access to the files, mints a DOI for the data package, and registers it with DataCite display the RDF manifest metadata, and enriches it (e.g. with the DOI) indexes the metadata, and provides a search and browse interface
DataBank is, in actuality, just an interface layer over a generic object store, as Neil will explain later this morning

DataFlow software services - summary

Researchers

Zipped BagIt Data Package with RDF metadata manifest DataStage file system Researchers, other users

DataBank repository

The DataStage / DataBank Beta Launch

The DataFlow Project has involved

taking our initial working DataStage and DataBank prototypes

undertaking a complete code review, rewriting where necessary


improving the user interfaces

preparing the software for deployment in two forms


as a Virtual Machine to run in a VMWare environment as a Debian Package to install on the Ubuntu operating system

writing documentation to describe the installation and functionality can be run locally or on a cloud installation easy and customizable (e.g. your name & logo)

Beta releases v0.1 of these DataStage and DataBank services are now available

enable research groups and institutions to provide their members with zero-cost data management solutions (apart from hosting costs)

cloud provision can expand and shrink with requirements no need to build and staff your own local data centre

Acknowledgements

. . . thanks to the JISC UMF for funding

and acknowledgement of the excellent work of my DataFlow colleagues:

Bhavana Ananda, Katherine Fletcher, Graham Klyne (IBRG)


Ian Chard, Neil Jefferies, Anusha Ranganathan (Bodleian Library) Alex Dutton, Joseph Talbot (OU Computing Service) Gabriel Hanganu, Sander van der Waal (OSS Watch)

Ross Gardler (Open Directive LLP)


Neil Caithness, Matteo Turilli, David Wallom (Oxford e-Research Centre) Richard Jones, Ben OSteen (Cottage Labs)

Stephanie Taylor (Critical Eye Communications)


Matthew Barker, Tom Ellis, Alex Hartwig (Cannonical Ltd)

. . . time for a user endorsement

Chris Holland, Department of Zoology

. . . and a DataStage demo


Graham Klyne, architect of the original DataStage prototype


Bhavana Ananda, current DataStage developer

New for Beta Release v0.2, early April 2012

Integration of SWORD v2 repository submission protocol

DataStage data packages can be submitted to any SWORD-compliant repository (e.g. the Dryad Data Repository, www.datadryad.org) DataBank will be able to ingest data packages from any SWORD client

DataBank, as well as DataStage, will by then have Debian packaging for ease of deployment onto Ubuntu Linux hosts Re-inclusion of WebDAV, to permit users to read and write via Web access

Deployment will be tested on a wider range of cloud hosting environments


for both VMWare virtual machine and Debian package installation including the Eduserv academic cloud

User interface improvement and additional functionality on the basis of existing plans and user feedback

Leading to a fully-featured release (Version 1.0) in May 2012

DataFlow services summary adding SWORD

Researchers

Zipped BagIt Data Package with RDF metadata manifest DataStage file system Researchers, other users

SWORD deposit protocol

DataBank repository

The conventional research data lifecycle

Scholarly publications: conference papers and journal articles Hypothesis formulation and project design Publication activities Research results and conclusions

Institutional repositories

Research plan

Experimentation and data creation


Raw data in research notebooks and live PC files

Data selection and interpretation


Research datasets abandoned on local hard drives or CD-ROMs

The DataFlow-enhanced research data lifecycle


Dissemination Open data on Web
Scholarly publications: conference papers and journal articles Hypothesis formulation and project design Publication activities

DataBank repository
Archived datasets Preservation

Research plan

Research results and conclusions

Experimentation and data creation Raw data in research notebooks and live PC files

Data selection and interpretation

DataStage filestore
Private yet sharable Management

So what have we got in DataStage?


Just a file store, appearing as a mapped drive easy to use

Customizable access controls to suit different types of groups


Does not require software installation on users computer

Uses standard software components found on every client machine Cross-platform Windows, Mac or Linux

DataStage server hosted on Ubuntu Linux system

Deployable locally, or on a cloud

FREE, apart from hosting costs

Has Web access, permitting Web apps to be built on top


For example, for data packaging and SWORD repository submission Other Web apps possible . . .

Can be used for other things than just storing datasets

Wider applications of DataStage

Escaping the Ivory Tower

Applications in commerce

Applications in education

Adding a security app


Security wrapper

Data Packaging Data Packaging Data Security DataBank or other SWORD repository
SWORD deposit protocol

DataStage kernel

Time-stamp each data file using irrevocable method Encrypt each data file using, for example, the OpenPGP standard Create a data package of time-stamped encrypted files

Compute the UNF (Universal Numeric Fingerprint) for date package, so one can later ensure that it has not been altered Applications:

Experimental data security for patent application e.g. pharmaceuticals Secure storage of financial data many commercial companies

Raspberry Pi computer

Designed by David Braben of the Raspberry Pi Foundation in Cambridge First released on 29 February 2012 Size of a credit card, and cost ~25 for a configured system Intended to stimulate the teaching of basic computer science in schools

Raspberry Pi computer schematic

Ethernet port, two USB ports, HDMI monitor socket 700 MHz ARM processor running Linux Programmable in Python, C, BBC Basic 256 Mb RAM (eight times capacity of BBC Micro B) Storage on SD card (16 Gb card costs about 10) Samba file sharing permits connection to external drives

Pi Store (aka DataStage) for classroom data integration

Pi Store

One Pi Store for each class


A cloud-based data integration solution Each pupil has a private directory to store stuff

Accessible from school or from home


The teacher has access to all pupils folders, for example to permit marking homework

DataStage folders

Typically a researcher will use his private folder for daily work

The research group leader can read files in that folder

Files placed in the Shared folder can also be read by other group members, and those place in the Collaborative folder can be written and read by all

DataStage metadata are limited

Intentionally, DataStage metadata are limited to author, title, identifier, date and description This is to encourage researchers to submit datasets to their repository, bearing in mind Grahams concept of curation by addition Additional rich metadata can be included in a separate metadata file as part of the entire data package, in XML or RDF format DataBank can recognize such a file and index the metadata, extracting elements for inclusion in the RDF manifest Separately from the DataFlow Project, we have been developing a minimal metadata information model for describing a research investigation and the various research outputs (papers, datasets, protocols, workflows, etc.) that may result from the investigation Tanya Gray has encoded this as an XML model, and can dynamically create from that model a Web form in which to enter such metadata Such rich metadata can form part of a DataStage data package

MIIDI data model - Minimal information for an Infectious Disease Investigation

The MIIDI input form for Research Investigation information

The MIIDI input form for Journal Article information

MIIRO data model - Minimal information for Investigations and Research Outputs

You might also like