Professional Documents
Culture Documents
2 March 2012
Introduction to DataStage
David Shotton (PI, JISC UMF DataFlow Project)
After carefully reviewing several data management systems last December, including Fedora Commons, DataVerse and DSpace, they concluded:
On paper, DataFlow is a winner: it meets (almost) all our requirements, especially because of DataStage, something other platforms don't offer. DataStage would be particularly appreciated, because it would make the integration of the system in the research workflow much less disruptive. Sadly, the availability of DataFlow software will come too late to be useful for our short project (October 2011 March 2012).
Well, now the DataFlow software systems, DataStage and DataBank, are available, and we hope they will meet the needs of many of you here
With twenty new papers each week, a researcher can never catch up there is just too much new scientific information being produced now Have to run to stand still - no time for fringe activities like data curation
negligible incentives and academic reward in terms of peer esteem, tenure or promotion for data publication activities
metadata concepts are foreign to most biomedical researchers large amount of effort involved in preparing data for publication
[From evidence submitted 5 August 2011 to the Royal Societys Science as a Public Enterprise policy study]
accommodates the data management tools with which you are already familiar (e.g. spreadsheets) provides services that are of immediate benefit in your day-to-day activities (e.g. shared file access) makes data management, data publication and data archiving activities sufficiently lightweight, intuitive and transparent that they are easily achieved, without imposing a significant cognitive overhead
By achieving this, we can bridge the gap between laboratory and repository
Researchers can save files to a secure private DataStage file store This is purely for their own benefit Just a file store - does not pose a cognitive overhead sheer curation Requires no software installation on the researchers computers
Managing data using a two-tier infrastructure Spanning the tiers: DataStage to DataBank
The special Web submission interface permits researchers to select and package data files for publication and long-term repository archiving
Easy to do When the researcher is ready Minimal metadata requirement, to encourage usage
The selected files are put in a special directory, with optional sub-directories
useful for large data files that already have stable storage locations
Packaging uses the BagIt file packaging specification from the California Digital Library (https://wiki.ucop.edu/display/Curation/BagIt) The resulting files are then zipped into a single object for transmission to DataBank, the institutional data repository
DataBank is a scalable data repository designed for institutional deployment Developed by the Bodleian Library, with a track record in preservation Cloud-deployable Easy for researcher to update a revised dataset if required Data packages normally published under a CCZero Open Data Waiver Confidential data packages can be kept in a separate dark repository Data packages assigned DOIs, making them citable (for academic credit) Optional user-defined embargo period to permit journal article publication
Upon receipt of a DataStage data package, DataBank unzips the data package to give access to the files, mints a DOI for the data package, and registers it with DataCite display the RDF manifest metadata, and enriches it (e.g. with the DOI) indexes the metadata, and provides a search and browse interface
DataBank is, in actuality, just an interface layer over a generic object store, as Neil will explain later this morning
Researchers
Zipped BagIt Data Package with RDF metadata manifest DataStage file system Researchers, other users
DataBank repository
as a Virtual Machine to run in a VMWare environment as a Debian Package to install on the Ubuntu operating system
writing documentation to describe the installation and functionality can be run locally or on a cloud installation easy and customizable (e.g. your name & logo)
Beta releases v0.1 of these DataStage and DataBank services are now available
enable research groups and institutions to provide their members with zero-cost data management solutions (apart from hosting costs)
cloud provision can expand and shrink with requirements no need to build and staff your own local data centre
Acknowledgements
DataStage data packages can be submitted to any SWORD-compliant repository (e.g. the Dryad Data Repository, www.datadryad.org) DataBank will be able to ingest data packages from any SWORD client
DataBank, as well as DataStage, will by then have Debian packaging for ease of deployment onto Ubuntu Linux hosts Re-inclusion of WebDAV, to permit users to read and write via Web access
for both VMWare virtual machine and Debian package installation including the Eduserv academic cloud
User interface improvement and additional functionality on the basis of existing plans and user feedback
Researchers
Zipped BagIt Data Package with RDF metadata manifest DataStage file system Researchers, other users
DataBank repository
Scholarly publications: conference papers and journal articles Hypothesis formulation and project design Publication activities Research results and conclusions
Institutional repositories
Research plan
DataBank repository
Archived datasets Preservation
Research plan
Experimentation and data creation Raw data in research notebooks and live PC files
DataStage filestore
Private yet sharable Management
Uses standard software components found on every client machine Cross-platform Windows, Mac or Linux
For example, for data packaging and SWORD repository submission Other Web apps possible . . .
Applications in commerce
Applications in education
Data Packaging Data Packaging Data Security DataBank or other SWORD repository
SWORD deposit protocol
DataStage kernel
Time-stamp each data file using irrevocable method Encrypt each data file using, for example, the OpenPGP standard Create a data package of time-stamped encrypted files
Compute the UNF (Universal Numeric Fingerprint) for date package, so one can later ensure that it has not been altered Applications:
Experimental data security for patent application e.g. pharmaceuticals Secure storage of financial data many commercial companies
Raspberry Pi computer
Designed by David Braben of the Raspberry Pi Foundation in Cambridge First released on 29 February 2012 Size of a credit card, and cost ~25 for a configured system Intended to stimulate the teaching of basic computer science in schools
Ethernet port, two USB ports, HDMI monitor socket 700 MHz ARM processor running Linux Programmable in Python, C, BBC Basic 256 Mb RAM (eight times capacity of BBC Micro B) Storage on SD card (16 Gb card costs about 10) Samba file sharing permits connection to external drives
Pi Store
A cloud-based data integration solution Each pupil has a private directory to store stuff
DataStage folders
Typically a researcher will use his private folder for daily work
Files placed in the Shared folder can also be read by other group members, and those place in the Collaborative folder can be written and read by all
Intentionally, DataStage metadata are limited to author, title, identifier, date and description This is to encourage researchers to submit datasets to their repository, bearing in mind Grahams concept of curation by addition Additional rich metadata can be included in a separate metadata file as part of the entire data package, in XML or RDF format DataBank can recognize such a file and index the metadata, extracting elements for inclusion in the RDF manifest Separately from the DataFlow Project, we have been developing a minimal metadata information model for describing a research investigation and the various research outputs (papers, datasets, protocols, workflows, etc.) that may result from the investigation Tanya Gray has encoded this as an XML model, and can dynamically create from that model a Web form in which to enter such metadata Such rich metadata can form part of a DataStage data package
MIIRO data model - Minimal information for Investigations and Research Outputs