Scientific Data Management

Someone said: “We are drowning in data, but starving of information”. And this is particularly true for scientific data. This happens also for business data, but here they had more time to learn. They implemented data architectures, created data warehouse and used data mining to extract information from their data. So why don't study and implement something similar for scientific data? The solution can be to setup a Scientific Data Management architecture.

Scientists normally limit the meaning of Data Management to the mere physical data storage and access layer. But the scope of Scientific Data Management is much boarder: it is about meaning and content.

Below I listed common problem and opportunities in scientific data access. Then I collected what are considered the parts of a Data Management solution. A list of references and examples of data access and scientific data collections follow.

The paper ends with more implementation oriented issues: a survey of some scientific data formats, planning for a possible implementation and a survey of the supporting technologies available.

Most of this paper notes and information have been collected and studied for one specific project. But really the ideas collected are generally applicable to the kind of scientific projects that uses the CSCS computational and visualization services.

Note: various link are no more valid. I try my best to fix them, but not always succeed.

Problems and opportunities

Problems that can be found in current scientific projects are for example:

Limited file and directory naming schemes. Some project data repositories are simply big flat directories.
Scientists retrieve entire files to ascertain relevance.
No access to important metadata in scientists' notebooks and heads.
Un-owned data with dubious content after the end of project or PhD thesis.

But the increasing of scientific data collections size brings not only problems, but also a lot of opportunities. One of the biggest opportunities is the possibility of reuse existing data for new studies. One example is provided by the various Virtual Observatory initiatives in Europe and USA. The idea is summarized below:

Another virtuous effect can be called "discovery by browsing". If the data is well described and the data access method is quite flexible, the user can establish unexpected correlations between data items thus facilitating serendipitous discoveries.

Last, but not least, remember that the data is composed not only by bytes, but also by workflow definitions, computation parameters, environment setup and so on.

An important note: the biopharmaceutical industry attacks a more specific meaning to Scientific Data Management. They also have huge data sets to be managed, but they must comply also with industry regulations and rigidly enforce intellectual property protection. The second point is important for each science field, but not as vital as in industry. In this paper we don't touch those specific problems.

Scientific Data Management areas

The paper Data Management Systems for Scientific Applications is a good survey of topics that should be covered by any Scientific Data Management system. Here I collected a quick list of the most important ones:

Creation of logical collections: The primary goal of a Data Management system is to abstract the physical data into logical collections. The resulting view of the data is a uniform homogeneous library collection.
Physical data handling: This layer maps between the physical to the logical data views. Here you find items like data replication, backup, caching, etc.
Interoperability support: Normally the data does not reside in the same place, or various data collection (like star catalogues) should be put together in the same logical collection.
Security support: Data access authorization and change verification. This is the basis of trusting your data.
Data ownership: Define who is responsible for data quality and meaning.
Metadata collection, management and access.: Metadata are data about data.
Persistence: Definition of data lifetime. Deployment of mechanisms to counteract technology obsolescence.
Knowledge and information discovery: Ability to identify useful relations and information inside the data collection.
Data dissemination and publication: Mechanism to make aware the interested parties of changes and additions to the collections.

General surveys

A number of Scientific Data Management surveys and links to research groups are available online. Here are some links I have found useful:

Workshop on Interfaces to Scientific Data Archives (March 1998)

This report contains various examples of scientific data management and access issues and solutions. The most important points are: importance of metadata, data quality assurance, time span of data validity, web browser as a broker to access data, persistent user interface interactions, results by e-mail or bulletin board. Another point stresses the importance of discovery by query/browsing the archive.

SciDAC Scientific Data Management Center

Terascale computing and large scientific experiments produce enormous quantities of data that require effective and efficient management. The task of managing scientific data is so overwhelming that scientists spend much of their time managing the data by developing special purpose solutions, rather than using their time effectively for scientific investigation and discovery.

The goal of this center is to establish an Enabling Technology Center that will provide a coordinated framework for the unification, development, deployment, and reuse of scientific data management software.

NEESgrid data repository project

Contains interesting references to metadata harvesting and more general information:

"Building a Repository of Distributed, Heterogeneous Scientific Data for NEESgrid". November 2001 [PPT]
Introduces the kind of research enabled by cheap data storage.
"Technology Strategies for Integrating Scientific Data Collections". November 2001 [PPT]
A brief explanation of XML technologies related to metadata plus federation and harvesting pros and cons.
"Designing Metadata for the NEESgrid Data repository". April 2003 [PPT]
This presentation contains an introduction to metadata and the comparison of old (one shot) and new (reuse) research methods.

Scientific Data Mining, Integration and Visualisation - Technical Report [PDF]

Workshop held by the UK e-Science Centre (October 2002). It is a general overview with some recommendations (like XML usage). Other workshop materials are available. The various presentation and case study on Data Mining and Visualization are interesting.

Virtual Observatory: A new concept and a discovery tool for Astronomy and Astrophysics Research [PPT or PDF]

In the context of the European Astrophysical Virtual Observatory this presentation survey the most important problems faced by a big archive of scientific data, the role of the Grid and demos some of the potential benefits and discoveries made possible by a good data management system.

Real world examples

Scientific data access examples

Here are some public examples of access to data collections. Almost all have a web browser as user interface. This choice has various interesting features like: insulation from the technology of the data archive and well know interaction paradigm.

GenBank

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.

Full text search plus similarity, scores, etc.

fMRIDC

fMRIDC is a public repository of peer-reviewed fMRI studies and their underlying data.

Full text search on articles.

Flytrap

Gene expression database in the brain of the fruit fly Drosophila.

Small database. Uses thumbnails as a guide to the correct dataset. The results are images or movies. The query is by hierarchy only.

Protein Data Bank

Protein Data Bank is the single worldwide repository for the processing and distribution of 3-D biological macromolecular structure data.

It offers full text search. Provides various options for result display: on-line structure browser, images, VRML models. The images and structures are created on the fly (seems).

Astronomical Digital Image Library (ADIL)

Extensive search page for the Astronomical Digital Image Library. The search style is quite classical with a limited drill-down capability on the search result.

The World Wide Web: Interactive Interface to Scientific Data

Contains an explanation of various web interfaces to scientific data. It points between others, to OMNIweb (below).

Near-Earth Heliosphere Data. From this interface the user can produce scatter plots and regression fits besides data listings.

ESO Science Archive Facility

This archive contains various astronomical databases with a web interface. The data are mainly FITS formatted images and the metadata are those provided by the FITS header.

SDSS SkyServer

The SDSS SkyServer - Public Access to the Sloan Digital Sky Server Data
Published as: MSR-TR-2001-104 by Jim Gray; Alexander Szalay; Ani Thakar; Peter Z. Zunszt; Tanu Malik; Jordan Raddick; Christopher Stoughton; Jan vandenBerg November 2001.

The SkyServer web interface permits drilldown from some know stellar objects. The user interface can be also a Java applet, but the main focus is on the big database behind the scenes.

The Open Archives Initiative Protocol for Metadata Harvesting

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) provides an application-independent interoperability framework based on metadata harvesting. It define the connection between Data Providers that expose metadata and Service Providers that use metadata harvested via the OAI-PMH as a basis for building value-added services.

This is oriented to Libraries, but the idea can be extended to normal scientific data archives. It is another method to access data collection metadata though programmatic connections.

What those examples have in common?

Emphasis on data retrieval
Some kind of visualization (thumbnails) to help retrieval
Some kind of on-demand visualization
Each one defines a data schema and tries to make the data as semantically rich as possible
Emphasis on data management of scientific data collections
Web interface based on standards

An unrelated field example

To understand the problems related to Scientific Data Management you can look at a related example from the field of Digital Libraries.

They had to manage a lot of small data sets, tracing information related to them, recording relationships and related knowledge.

To support this effort they devised a standard called Metadata Encoding & Transmission Standard (METS). The METS pages contain presentations about the current problems and proposed solutions. METS has been one of the inspiring projects for my Scientific Data Bag library.

Data management systems

More complete examples of Scientific Data Management systems exist. Here a small selection:

SimTracker

"SimTracker - using the web to track computer simulation results". Published for the 1999 International Conference on Web-Based Modeling and Simulation, San Francisco, CA. Proceedings available as Simulation Series Vol. 31, Num. 3, from The Society for Computer Simulation.

Ideas: "non-intrusiveness", only a small change in the usual simulation run method to add metadata collection. It is composed by three parts: metadata extractor and computation results summarizer, metadata storage, dynamic web pages generator. SimTracker manage the workflow of a simulation run. There are provisions for adding manual annotations. Templates for resulting web pages. Templates for filenames to be tracked and so on to be generic

Scientific Annotation Middleware (SAM)

The Scientific Annotation Middleware system will provide the significant advances in research documentation and data pedigree tracking required for effective management and coordination of the complex, collaborative, cross-disciplinary, compute-intensive research.

ARION An Advanced Lightweight Architecture for Accessing Scientific Collections.

ARION is a very complete marine study data management environment. It is based on RDF and covers metadata and workflow management. The users of the data must define an ontology that supports their queries.

NuGenesis

This commercial product is an example of what means Data Management for the biopharmaceutical industry. There is a lot more emphasis on regulatory compliance and intellectual property protection issues.

Data access for visualization

Reducing Data Distribution Bottlenecks by Employing Data Visualization Filters: The proposed methodology is based on a fundamental paradigm that the end result (visualization) rendered by a data consumer can, in many cases, be produced using a reduced data set that has been distilled or filtered from the original data set.
Remote visualization: It is an important topic because bigger data sets cannot be moved to the visualization system.

Physical components

Thus the components and processes defined by a Scientific Data Management system are:

The most important thing is the metadata schema definitions.
Data ingestion process (metadata collection and organization).
Physical data access.
User data access interface.
Metadata storage and management.
Workflow definition and management.
Rules for of ownership and data lifespan definition.
Data quality assessment process.

Metadata

Metadata are data about data. It is another manner to say record the meaning of your data. First of all what are meaningful data about your data must be defines. This is called metadata schema.

To be useful a metadata schema must be semantically rich and the collected metadata must be quality assured.

A Reference Model for Meta-Data - A Strawman: It has a nice example of metadata importance discovery.
Developing Metadata Standards for Scientific Data Reuse in NCSA's Distributed Grid Architecture [PDF]: Metadata standard must be implementation neutral with explicitly limited extensibility.
Define useful metadata (from the now defunct site http://www.llnl.gov/ia/sc95/annotatingSC95.html)
INFORMATION/DATA/METADATA MANAGEMENT – GENERAL RESOURCES: Provides an extensive set of links.
Reggie – The Metadata Editor [broken link]: The aim of the Reggie Metadata Editor is to enable the easy creation of various forms of metadata with the one flexible program.
Metadata harvesting: The metadata are collected either manually or automatically during data ingestion phase. One of the automatic methods can be the so called metadata harvester: programs like the WWW spiders or robot that on background collects the requested metadata.
Automatic data entry definition: If the metadata are clearly defined a user interface can be automatically extracted from their definitions. An example is the formSIX system.

Around there is a lot of discussion about the Semantic Web, RDF the resource description framework, ontology's and so on. But I think that for now it is too much for the scientific field. The risk is to complexify so much the data management implementation, that no one will use the system.

It is far better to start introducing a controlled vocabulary and start defining and collecting metadata for your data. A data dictionary written as DTD or Xschema definitions is the absolute minimum required.

Workflow

Together with the meaning of the data, the method used to compute them must be stored: this is called the workflow. The workflow is needed also to regenerate derived data, without storing them for example.

So, what help the projects expect

Don't lose what they already have (data, knowledge)
Make search (and retrieval!) possible
Make possible discovery by browsing (serendipity)
Alleviate the burden in preparing data for publication (check "How does my CIF become a printed paper?" for an example)
Reduce data entropy
Support notification of new data and dissemination of research results

Scientific Data Formats

Data Management issues are obviously independent from the kind and format of the data managed. But some data formats are more suited to support metadata and more helpful in reaching the various data management goals.

The most suited formats are the ones based on XML. But they are so new and defined for specific applications that it is better for now to analyze more traditional formats like HDF5 and CGNS.

XML based formats

XML for Scientific Computing [PPT]

Nice overview of XML, XML based formats (XSIL, XDMF, CML) and HDF5.

Describing Astronomical Catalogues and Query Results with XML

An example of scientific data described using XML.

XSIL: Extensible Scientific Interchange Language

The Extensible Scientific Interchange Language (XSIL) is a flexible, hierarchical, extensible, transport language for scientific data objects.

The Binary Format Description (BFD) Language is an XML dialect based on XSIL that supports the executable documentation of 'arbitrary' binary and ASCII data sets. Applying a BFD template to a set of files produces an XML output containing the original data in an XML-tagged format that can be interpreted by other programs or subjected to further processing (i.e. using XSLT).

eXtensible Data Format (XDF)

XDF is a common scientific data format based on XML and general mathematical principles that can be used throughout the scientific disciplines. It includes these key features: hierarchical data structures, any dimensional arrays merged with coordinate information, high dimensional tables merged with field information, variable resolution, easy wrapping of existing data, user specified coordinate systems, searchable ASCII meta-data, and extensibility to new features/data formats.

One of the most interesting usages of XDF is the XML-ization of FITS (Flexible Image Transport System) that is used primarily in astronomy.

Interdisciplinary Computing Environment and XDMF

The Interdisciplinary Computing Environment (ICE) is an effort to provide a common software platform for Scientific Codes in a heterogeneous High Performance Computing environment. This platform includes the eXtensible Data Model and Format (XDMF), a common data hub where HPC codes and tools can efficiently exchange data values and meaning. Another good explanation of the XDMF format is in the paper: " The eXtensible Data Model and Format for Interdisciplinary Computing".

Data Format Description Language (DFDL)

XML provides an essential mechanism for transferring data between services in an application and platform neutral format. However it is not well suited to large datasets with repetitive structures, such as large arrays or tables. Furthermore, many legacy systems and valuable data sets exist that do not use the XML format. The aim of this working group is to define an XML-based language, the Data Format Description Language (DFDL), for describing the structure of binary and character encoded (ASCII/Unicode) files and data streams so that their format, structure, and metadata can be exposed. This effort specifically does not aim to create a generic data representation language. Rather, DFDL endeavors to describe existing formats in an actionable manner that makes the data in its current format accessible through generic mechanisms.

The DFDL description would sit in a (logically) separate file from the data itself. The description would provide a hierarchical description that would structure and semantically label the underlying bits. It would capture: how bits are to be interpreted as parts of low-level data types (ints, floats, strings) how low-level types are assembled into scientifically relevant forms such as arrays how meaning is assigned to these forms through association with variable names and metadata such as units how arrays and the overall structure of the binary file are parameterized based on array dimensions, flags specifying optional file components, etc. Further, if the data file contains highly repetitive structures, such as large arrays or tables, such a description can be very concise.

DFDL is a Global Grid Forum standard activity that is building on BinX and other work to provide a general and extensible platform for describing data formats. The idea is to add a DFDL formatted file that describes a given legacy file. An application thus can read and interpret this one without needing a specific reader, only the DFDL reader.

edikt::BinX

BinX is used to describe the content, structure and physical layout (endianess, blocksize…) of binary files. It will be a reference implementation for DFDL.

HDF5

HDF5 (Hierarchical Data Format) is a general purpose library and file format for storing scientific data. HDF5 can store two primary objects: datasets and groups. A dataset is essentially a multidimensional array of data elements, and a group is a structure for organizing objects in an HDF5 file. Using these two basic objects, one can create and store almost any kind of scientific data structure, such as images, arrays of vectors, and structured and unstructured grids. You can also mix and match them in HDF5 files according to your needs.

A discussion about HDF as an archive format: Contains suggestion on data definition (like standard time formats) to enhance longevity of the file content.
A good example of HDF5 information model and implementation specification for weather radar data [PDF]: Contains definition of what is an information model respect to the physical storage format and examples of metadata organization and definition.
HDF-EOS: Issues about storing Earth Observing System (EOS) satellite data with HDF format files.
Improving Access to Multi-dimensional Self-describing Scientific Datasets [PDF]: Covers the data chunking and spatial index issues in HDF files.

CGNS

The CFD General Notation System (CGNS) consists of a collection of conventions, and software implementing those conventions, for the storage and retrieval of CFD (computational fluid dynamics) data.

The CGNS system is designed to facilitate the exchange of data between sites and applications, and to help stabilize the archiving of aerodynamic data. The data are stored in a compact, binary format and are accessible through a complete and extensible library of functions. The API (Application Program Interface) is platform independent and can be easily implemented in C, C++, Fortran and Fortran90 applications.

CFD General Notation System Standard Interface Data Structures (SIDS): Precisely defines the "intellectual content" of CFD-related data, including the organizational structure supporting the data and the conventions adopted to standardize the data exchange process.
pyCGNS: Python CGNS access library. It defines also a DTD to validate a CGNS file structure described as XML.

Implementation issues

From my limited observation point, I think that the best strategy to implement a Scientific Data Management solution will be an evolutionary one. It is better to start from a small solution for a specific project, then gain experience, and then implement a new step and so on.

In this process it is important to fight for a generally applicable solution, so the effort can be ready reused and extended to other projects.

To be successful, a Data Management solution should not loose contact with the user real problems.

Supporting technologies

XML	Tree structured data	There are a lot of XML tutorials. Here are some I found interesting: IBM DeveloperWorks and XML.com. And extensive lists of tools: XML tools by category and XMLSITES.
DTD and XSchema	Dictionary	Generate DTD and XSchema from XML file
RDF	Resource graphs	RDF Primer and a practical application: Describing and retrieving photos using RDF and HTTP
GUI autogeneration	e.g. formSIX	FormSIX creates data entry forms from DTD files.
XSLT	Format transformation	XSL Reference and a XSLT processor: XT
XML databases	XML native	Xindice
RSS	Data deployment	RSS 2.0 reference
HTML and JavaScript	Light client	JavaScript Reference and Guide and DOM Reference.
PHP	Server side dynamic page creation	PHP XML parser functions
Java	Client side applets	Java home

Conclusions

This document forms the basis of discussion for the introduction of Scientific Data Management methods and tools in a specific project. The document will be update as soon as experience will be gained in this project. The goal is to implement and specify something useful that can support and help other CSCS projects.

To implement a Scientific Data Management solution in a specific project the steps to be followed are:

Data format definition
Metadata schema definition
Implementation plan definition
Selection of technologies to be used