Overview of DataFlow

DataFlow VIDaaS Launch Event Sad Business School, Oxford University
2 March 2012
The JISC UMF DataFlow Project

http://www.dataflow.ox.ac.uk
Introduction to DataStage
David Shotton (PI, JISC UMF DataFlow Project)
Image BioInformatics Research Group Department of Zoology University of Oxford, UK

http:/ibrg.zoo.ox.ac.uk e-mail: david.shotton@zoo.ox.ac.uk
David Shotton, 2012 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence
And the winning platform is . . .
At Queen Mary College, there is a JISC MRD Project entitled
Sustainable Management of Digital Music Research Data

http://rdm.c4dm.eecs.qmul.ac.uk/
After carefully reviewing several data management systems last December, including Fedora Commons, DataVerse and DSpace, they concluded:

On paper, DataFlow is a winner: it meets (almost) all our requirements, especially because of DataStage, something other platforms don't offer. DataStage would be particularly appreciated, because it would make the integration of the system in the research workflow much less disruptive. Sadly, the availability of DataFlow software will come too late to be useful for our short project (October 2011 March 2012).
Well, now the DataFlow software systems, DataStage and DataBank, are available, and we hope they will meet the needs of many of you here
Why dont researchers publish data?

Three pressures presently prevent researchers from publishing their data
Information overload and pressure of work

With twenty new papers each week, a researcher can never catch up there is just too much new scientific information being produced now Have to run to stand still - no time for fringe activities like data curation
Departmental pressure for financial viability, determined by the REF
pressure to win grants and to publish in high impact journals
negligible incentives and academic reward in terms of peer esteem, tenure or promotion for data publication activities
Cognitive overhead and skill barriers to best-practice data management

metadata concepts are foreign to most biomedical researchers large amount of effort involved in preparing data for publication
[From evidence submitted 5 August 2011 to the Royal Societys Science as a Public Enterprise policy study]
Easing the pain of data archiving and publication
Making data management as simple as possible

- the principle of sheer curation
(http://en.wikipedia.org/wiki/Sheer_curation) Create a data management infrastructure that:

works with you rather than against you
accommodates the data management tools with which you are already familiar (e.g. spreadsheets) provides services that are of immediate benefit in your day-to-day activities (e.g. shared file access) makes data management, data publication and data archiving activities sufficiently lightweight, intuitive and transparent that they are easily achieved, without imposing a significant cognitive overhead
By achieving this, we can bridge the gap between laboratory and repository
Managing data using a two-tier infrastructure Tier One: DataStage
Researchers can save files to a secure private DataStage file store This is purely for their own benefit Just a file store - does not pose a cognitive overhead sheer curation Requires no software installation on the researchers computers
Designed for deployment at the research group level, locally or on a cloud

Primary access is as a mapped network drive, Drive D:, on each computer You save files to DataStage just as you would to your local hard drive No restrictions or limitations of file type whatever you normally use
Web access allows users to browse files within DataStage

Advantages over a cheap hard drive from PC World under your desk: Regular nightly automated backup no need to remember to do so Private, shared and collaborative areas, with controlled group access Additional Web interface to DataStage, using the same user credentials Can invite overseas colleagues to access your files, via password control
Managing data using a two-tier infrastructure Spanning the tiers: DataStage to DataBank
The special Web submission interface permits researchers to select and package data files for publication and long-term repository archiving

Easy to do When the researcher is ready Minimal metadata requirement, to encourage usage
The selected files are put in a special directory, with optional sub-directories
The files are accompanied by a simple metadata stored as an RDF manifest

It is possible to represent data files stored elsewhere using URIs
useful for large data files that already have stable storage locations
Packaging uses the BagIt file packaging specification from the California Digital Library (https://wiki.ucop.edu/display/Curation/BagIt) The resulting files are then zipped into a single object for transmission to DataBank, the institutional data repository
Managing data using a two-tier infrastructure Tier Two: DataBank
DataBank is a scalable data repository designed for institutional deployment Developed by the Bodleian Library, with a track record in preservation Cloud-deployable Easy for researcher to update a revised dataset if required Data packages normally published under a CCZero Open Data Waiver Confidential data packages can be kept in a separate dark repository Data packages assigned DOIs, making them citable (for academic credit) Optional user-defined embargo period to permit journal article publication
Upon receipt of a DataStage data package, DataBank unzips the data package to give access to the files, mints a DOI for the data package, and registers it with DataCite display the RDF manifest metadata, and enriches it (e.g. with the DOI) indexes the metadata, and provides a search and browse interface
DataBank is, in actuality, just an interface layer over a generic object store, as Neil will explain later this morning
DataFlow software services - summary
Researchers
Zipped BagIt Data Package with RDF metadata manifest DataStage file system Researchers, other users
DataBank repository
The DataStage / DataBank Beta Launch
The DataFlow Project has involved
taking our initial working DataStage and DataBank prototypes
undertaking a complete code review, rewriting where necessary

improving the user interfaces
preparing the software for deployment in two forms

as a Virtual Machine to run in a VMWare environment as a Debian Package to install on the Ubuntu operating system
writing documentation to describe the installation and functionality can be run locally or on a cloud installation easy and customizable (e.g. your name & logo)
Beta releases v0.1 of these DataStage and DataBank services are now available

enable research groups and institutions to provide their members with zero-cost data management solutions (apart from hosting costs)

cloud provision can expand and shrink with requirements no need to build and staff your own local data centre
Acknowledgements
. . . thanks to the JISC UMF for funding
and acknowledgement of the excellent work of my DataFlow colleagues:
Bhavana Ananda, Katherine Fletcher, Graham Klyne (IBRG)

Ian Chard, Neil Jefferies, Anusha Ranganathan (Bodleian Library) Alex Dutton, Joseph Talbot (OU Computing Service) Gabriel Hanganu, Sander van der Waal (OSS Watch)
Ross Gardler (Open Directive LLP)

Neil Caithness, Matteo Turilli, David Wallom (Oxford e-Research Centre) Richard Jones, Ben OSteen (Cottage Labs)
Stephanie Taylor (Critical Eye Communications)

Matthew Barker, Tom Ellis, Alex Hartwig (Cannonical Ltd)
. . . time for a user endorsement
Chris Holland, Department of Zoology
. . . and a DataStage demo

Graham Klyne, architect of the original DataStage prototype

Bhavana Ananda, current DataStage developer
New for Beta Release v0.2, early April 2012
Integration of SWORD v2 repository submission protocol
DataStage data packages can be submitted to any SWORD-compliant repository (e.g. the Dryad Data Repository, www.datadryad.org) DataBank will be able to ingest data packages from any SWORD client
DataBank, as well as DataStage, will by then have Debian packaging for ease of deployment onto Ubuntu Linux hosts Re-inclusion of WebDAV, to permit users to read and write via Web access
Deployment will be tested on a wider range of cloud hosting environments

for both VMWare virtual machine and Debian package installation including the Eduserv academic cloud
User interface improvement and additional functionality on the basis of existing plans and user feedback
Leading to a fully-featured release (Version 1.0) in May 2012
DataFlow services summary adding SWORD
Researchers
Zipped BagIt Data Package with RDF metadata manifest DataStage file system Researchers, other users
SWORD deposit protocol
DataBank repository
The conventional research data lifecycle
Scholarly publications: conference papers and journal articles Hypothesis formulation and project design Publication activities Research results and conclusions
Institutional repositories
Research plan
Experimentation and data creation

Raw data in research notebooks and live PC files
Data selection and interpretation

Research datasets abandoned on local hard drives or CD-ROMs
The DataFlow-enhanced research data lifecycle

Dissemination Open data on Web
Scholarly publications: conference papers and journal articles Hypothesis formulation and project design Publication activities
DataBank repository
Archived datasets Preservation
Research plan
Research results and conclusions
Experimentation and data creation Raw data in research notebooks and live PC files
Data selection and interpretation
DataStage filestore
Private yet sharable Management
So what have we got in DataStage?

Just a file store, appearing as a mapped drive easy to use
Customizable access controls to suit different types of groups

Does not require software installation on users computer

Uses standard software components found on every client machine Cross-platform Windows, Mac or Linux
DataStage server hosted on Ubuntu Linux system
Deployable locally, or on a cloud
FREE, apart from hosting costs
Has Web access, permitting Web apps to be built on top

For example, for data packaging and SWORD repository submission Other Web apps possible . . .
Can be used for other things than just storing datasets
Wider applications of DataStage
Escaping the Ivory Tower
Applications in commerce
Applications in education
Adding a security app

Security wrapper
Data Packaging Data Packaging Data Security DataBank or other SWORD repository
SWORD deposit protocol
DataStage kernel
Time-stamp each data file using irrevocable method Encrypt each data file using, for example, the OpenPGP standard Create a data package of time-stamped encrypted files
Compute the UNF (Universal Numeric Fingerprint) for date package, so one can later ensure that it has not been altered Applications:

Experimental data security for patent application e.g. pharmaceuticals Secure storage of financial data many commercial companies
Raspberry Pi computer
Designed by David Braben of the Raspberry Pi Foundation in Cambridge First released on 29 February 2012 Size of a credit card, and cost ~25 for a configured system Intended to stimulate the teaching of basic computer science in schools
Raspberry Pi computer schematic
Ethernet port, two USB ports, HDMI monitor socket 700 MHz ARM processor running Linux Programmable in Python, C, BBC Basic 256 Mb RAM (eight times capacity of BBC Micro B) Storage on SD card (16 Gb card costs about 10) Samba file sharing permits connection to external drives
Pi Store (aka DataStage) for classroom data integration
Pi Store
One Pi Store for each class

A cloud-based data integration solution Each pupil has a private directory to store stuff
Accessible from school or from home

The teacher has access to all pupils folders, for example to permit marking homework
DataStage folders
Typically a researcher will use his private folder for daily work
The research group leader can read files in that folder
Files placed in the Shared folder can also be read by other group members, and those place in the Collaborative folder can be written and read by all
DataStage metadata are limited
Intentionally, DataStage metadata are limited to author, title, identifier, date and description This is to encourage researchers to submit datasets to their repository, bearing in mind Grahams concept of curation by addition Additional rich metadata can be included in a separate metadata file as part of the entire data package, in XML or RDF format DataBank can recognize such a file and index the metadata, extracting elements for inclusion in the RDF manifest Separately from the DataFlow Project, we have been developing a minimal metadata information model for describing a research investigation and the various research outputs (papers, datasets, protocols, workflows, etc.) that may result from the investigation Tanya Gray has encoded this as an XML model, and can dynamically create from that model a Web form in which to enter such metadata Such rich metadata can form part of a DataStage data package
MIIDI data model - Minimal information for an Infectious Disease Investigation
The MIIDI input form for Research Investigation information
The MIIDI input form for Journal Article information
MIIRO data model - Minimal information for Investigations and Research Outputs

Overview of DataFlow

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Overview of DataFlow

Uploaded by

Copyright:

Available Formats

DataFlow VIDaaS Launch Event Sad Business School, Oxford University

The JISC UMF DataFlow Project

Image BioInformatics Research Group Department of Zoology University of Oxford, UK

And the winning platform is . . .

At Queen Mary College, there is a JISC MRD Project entitled

Sustainable Management of Digital Music Research Data

Why dont researchers publish data?

Information overload and pressure of work

Departmental pressure for financial viability, determined by the REF

pressure to win grants and to publish in high impact journals

Cognitive overhead and skill barriers to best-practice data management

Easing the pain of data archiving and publication

Making data management as simple as possible

works with you rather than against you

Managing data using a two-tier infrastructure Tier One: DataStage

Designed for deployment at the research group level, locally or on a cloud

Web access allows users to browse files within DataStage

The files are accompanied by a simple metadata stored as an RDF manifest

Managing data using a two-tier infrastructure Tier Two: DataBank

DataFlow software services - summary

The DataStage / DataBank Beta Launch

The DataFlow Project has involved

taking our initial working DataStage and DataBank prototypes

undertaking a complete code review, rewriting where necessary

preparing the software for deployment in two forms

. . . thanks to the JISC UMF for funding

and acknowledgement of the excellent work of my DataFlow colleagues:

Bhavana Ananda, Katherine Fletcher, Graham Klyne (IBRG)

Ross Gardler (Open Directive LLP)

Stephanie Taylor (Critical Eye Communications)

. . . time for a user endorsement

Chris Holland, Department of Zoology

. . . and a DataStage demo

Graham Klyne, architect of the original DataStage prototype

New for Beta Release v0.2, early April 2012

Integration of SWORD v2 repository submission protocol

Deployment will be tested on a wider range of cloud hosting environments

Leading to a fully-featured release (Version 1.0) in May 2012

DataFlow services summary adding SWORD

SWORD deposit protocol

The conventional research data lifecycle

Experimentation and data creation

Data selection and interpretation

The DataFlow-enhanced research data lifecycle

Research results and conclusions

Data selection and interpretation

So what have we got in DataStage?

Just a file store, appearing as a mapped drive easy to use

Customizable access controls to suit different types of groups

DataStage server hosted on Ubuntu Linux system

Deployable locally, or on a cloud

FREE, apart from hosting costs

Has Web access, permitting Web apps to be built on top

Can be used for other things than just storing datasets

Wider applications of DataStage

Escaping the Ivory Tower

Adding a security app

Raspberry Pi computer schematic

Pi Store (aka DataStage) for classroom data integration

One Pi Store for each class

Accessible from school or from home

The research group leader can read files in that folder

DataStage metadata are limited