Someone said: “We are drowning in data, but starving of information”. And this is particularly true for scientific data. This happens also for business data, but here they had more time to learn. They implemented data architectures, created data warehouse and used data mining to extract information from their data. So why don't study and implement something similar for scientific data? The solution can be to setup a Scientific Data Management architecture.
Scientists normally limit the meaning of Data Management to the mere physical data storage and access layer. But the scope of Scientific Data Management is much boarder: it is about meaning and content.
Below I listed common problem and opportunities in scientific data access. Then I collected what are considered the parts of a Data Management solution. A list of references and examples of data access and scientific data collections follow.
The paper ends with more implementation oriented issues: a survey of some scientific data formats, planning for a possible implementation and a survey of the supporting technologies available.
Most of this paper notes and information have been collected and studied for one specific project. But really the ideas collected are generally applicable to the kind of scientific projects that uses the CSCS computational and visualization services.
Note: various link are no more valid. I try my best to fix them, but not always succeed.
Problems that can be found in current scientific projects are for example:
But the increasing of scientific data collections size brings not only problems, but also a lot of opportunities. One of the biggest opportunities is the possibility of reuse existing data for new studies. One example is provided by the various Virtual Observatory initiatives in Europe and USA. The idea is summarized below:
Another virtuous effect can be called "discovery by browsing". If the data is well described and the data access method is quite flexible, the user can establish unexpected correlations between data items thus facilitating serendipitous discoveries.
Last, but not least, remember that the data is composed not only by bytes, but also by workflow definitions, computation parameters, environment setup and so on.
An important note: the biopharmaceutical industry attacks a more specific meaning to Scientific Data Management. They also have huge data sets to be managed, but they must comply also with industry regulations and rigidly enforce intellectual property protection. The second point is important for each science field, but not as vital as in industry. In this paper we don't touch those specific problems.
The paper Data Management Systems for Scientific Applications is a good survey of topics that should be covered by any Scientific Data Management system. Here I collected a quick list of the most important ones:
A number of Scientific Data Management surveys and links to research groups are available online. Here are some links I have found useful:
This report contains various examples of scientific data management and access issues and solutions. The most important points are: importance of metadata, data quality assurance, time span of data validity, web browser as a broker to access data, persistent user interface interactions, results by e-mail or bulletin board. Another point stresses the importance of discovery by query/browsing the archive.
Terascale computing and large scientific experiments produce enormous quantities of data that require effective and efficient management. The task of managing scientific data is so overwhelming that scientists spend much of their time managing the data by developing special purpose solutions, rather than using their time effectively for scientific investigation and discovery.
The goal of this center is to establish an Enabling Technology Center that will provide a coordinated framework for the unification, development, deployment, and reuse of scientific data management software.
Contains interesting references to metadata harvesting and more general information:
Workshop held by the UK e-Science Centre (October 2002). It is a general overview with some recommendations (like XML usage). Other workshop materials are available. The various presentation and case study on Data Mining and Visualization are interesting.
In the context of the European Astrophysical Virtual Observatory this presentation survey the most important problems faced by a big archive of scientific data, the role of the Grid and demos some of the potential benefits and discoveries made possible by a good data management system.
Here are some public examples of access to data collections. Almost all have a web browser as user interface. This choice has various interesting features like: insulation from the technology of the data archive and well know interaction paradigm.
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.
Full text search plus similarity, scores, etc.
fMRIDC is a public repository of peer-reviewed fMRI studies and their underlying data.
Full text search on articles.
Gene expression database in the brain of the fruit fly Drosophila.
Small database. Uses thumbnails as a guide to the correct dataset. The results are images or movies. The query is by hierarchy only.
Protein Data Bank is the single worldwide repository for the processing and distribution of 3-D biological macromolecular structure data.
It offers full text search. Provides various options for result display: on-line structure browser, images, VRML models. The images and structures are created on the fly (seems).
Extensive search page for the Astronomical Digital Image Library. The search style is quite classical with a limited drill-down capability on the search result.
Contains an explanation of various web interfaces to scientific data. It points between others, to OMNIweb (below).
Near-Earth Heliosphere Data. From this interface the user can produce scatter plots and regression fits besides data listings.
This archive contains various astronomical databases with a web interface. The data are mainly FITS formatted images and the metadata are those provided by the FITS header.
The SDSS SkyServer - Public Access to the Sloan Digital Sky Server Data
Published as: MSR-TR-2001-104 by Jim Gray; Alexander Szalay; Ani Thakar; Peter Z. Zunszt; Tanu Malik; Jordan Raddick; Christopher Stoughton; Jan vandenBerg November 2001.
The SkyServer web interface permits drilldown from some know stellar objects. The user interface can be also a Java applet, but the main focus is on the big database behind the scenes.
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) provides an application-independent interoperability framework based on metadata harvesting. It define the connection between Data Providers that expose metadata and Service Providers that use metadata harvested via the OAI-PMH as a basis for building value-added services.
This is oriented to Libraries, but the idea can be extended to normal scientific data archives. It is another method to access data collection metadata though programmatic connections.
What those examples have in common?
To understand the problems related to Scientific Data Management you can look at a related example from the field of Digital Libraries.
They had to manage a lot of small data sets, tracing information related to them, recording relationships and related knowledge.
To support this effort they devised a standard called Metadata Encoding & Transmission Standard (METS). The METS pages contain presentations about the current problems and proposed solutions. METS has been one of the inspiring projects for my Scientific Data Bag library.
More complete examples of Scientific Data Management systems exist. Here a small selection:
"SimTracker - using the web to track computer simulation results". Published for the 1999 International Conference on Web-Based Modeling and Simulation, San Francisco, CA. Proceedings available as Simulation Series Vol. 31, Num. 3, from The Society for Computer Simulation.
Ideas: "non-intrusiveness", only a small change in the usual simulation run method to add metadata collection. It is composed by three parts: metadata extractor and computation results summarizer, metadata storage, dynamic web pages generator. SimTracker manage the workflow of a simulation run. There are provisions for adding manual annotations. Templates for resulting web pages. Templates for filenames to be tracked and so on to be generic
The Scientific Annotation Middleware system will provide the significant advances in research documentation and data pedigree tracking required for effective management and coordination of the complex, collaborative, cross-disciplinary, compute-intensive research.
ARION is a very complete marine study data management environment. It is based on RDF and covers metadata and workflow management. The users of the data must define an ontology that supports their queries.
This commercial product is an example of what means Data Management for the biopharmaceutical industry. There is a lot more emphasis on regulatory compliance and intellectual property protection issues.
The proposed methodology is based on a fundamental paradigm that the end result (visualization) rendered by a data consumer can, in many cases, be produced using a reduced data set that has been distilled or filtered from the original data set.
It is an important topic because bigger data sets cannot be moved to the visualization system.
Thus the components and processes defined by a Scientific Data Management system are:
Metadata are data about data. It is another manner to say record the meaning of your data. First of all what are meaningful data about your data must be defines. This is called metadata schema.
To be useful a metadata schema must be semantically rich and the collected metadata must be quality assured.
It has a nice example of metadata importance discovery.
Metadata standard must be implementation neutral with explicitly limited extensibility.
Provides an extensive set of links.
The aim of the Reggie Metadata Editor is to enable the easy creation of various forms of metadata with the one flexible program.
The metadata are collected either manually or automatically during data ingestion phase. One of the automatic methods can be the so called metadata harvester: programs like the WWW spiders or robot that on background collects the requested metadata.
If the metadata are clearly defined a user interface can be automatically extracted from their definitions. An example is the formSIX system.
Around there is a lot of discussion about the Semantic Web, RDF the resource description framework, ontology's and so on. But I think that for now it is too much for the scientific field. The risk is to complexify so much the data management implementation, that no one will use the system.
It is far better to start introducing a controlled vocabulary and start defining and collecting metadata for your data. A data dictionary written as DTD or Xschema definitions is the absolute minimum required.
Together with the meaning of the data, the method used to compute them must be stored: this is called the workflow. The workflow is needed also to regenerate derived data, without storing them for example.
Data Management issues are obviously independent from the kind and format of the data managed. But some data formats are more suited to support metadata and more helpful in reaching the various data management goals.
The most suited formats are the ones based on XML. But they are so new and defined for specific applications that it is better for now to analyze more traditional formats like HDF5 and CGNS.
Nice overview of XML, XML based formats (XSIL, XDMF, CML) and HDF5.
An example of scientific data described using XML.
The Extensible Scientific Interchange Language (XSIL) is a flexible, hierarchical, extensible, transport language for scientific data objects.
The Binary Format Description (BFD) Language is an XML dialect based on XSIL that supports the executable documentation of 'arbitrary' binary and ASCII data sets. Applying a BFD template to a set of files produces an XML output containing the original data in an XML-tagged format that can be interpreted by other programs or subjected to further processing (i.e. using XSLT).
XDF is a common scientific data format based on XML and general mathematical principles that can be used throughout the scientific disciplines. It includes these key features: hierarchical data structures, any dimensional arrays merged with coordinate information, high dimensional tables merged with field information, variable resolution, easy wrapping of existing data, user specified coordinate systems, searchable ASCII meta-data, and extensibility to new features/data formats.
One of the most interesting usages of XDF is the XML-ization of FITS (Flexible Image Transport System) that is used primarily in astronomy.
The Interdisciplinary Computing Environment (ICE) is an effort to provide a common software platform for Scientific Codes in a heterogeneous High Performance Computing environment. This platform includes the eXtensible Data Model and Format (XDMF), a common data hub where HPC codes and tools can efficiently exchange data values and meaning. Another good explanation of the XDMF format is in the paper: " The eXtensible Data Model and Format for Interdisciplinary Computing".
XML provides an essential mechanism for transferring data between services in an application and platform neutral format. However it is not well suited to large datasets with repetitive structures, such as large arrays or tables. Furthermore, many legacy systems and valuable data sets exist that do not use the XML format. The aim of this working group is to define an XML-based language, the Data Format Description Language (DFDL), for describing the structure of binary and character encoded (ASCII/Unicode) files and data streams so that their format, structure, and metadata can be exposed. This effort specifically does not aim to create a generic data representation language. Rather, DFDL endeavors to describe existing formats in an actionable manner that makes the data in its current format accessible through generic mechanisms.
The DFDL description would sit in a (logically) separate file from the data itself. The description would provide a hierarchical description that would structure and semantically label the underlying bits. It would capture: how bits are to be interpreted as parts of low-level data types (ints, floats, strings) how low-level types are assembled into scientifically relevant forms such as arrays how meaning is assigned to these forms through association with variable names and metadata such as units how arrays and the overall structure of the binary file are parameterized based on array dimensions, flags specifying optional file components, etc. Further, if the data file contains highly repetitive structures, such as large arrays or tables, such a description can be very concise.
DFDL is a Global Grid Forum standard activity that is building on BinX and other work to provide a general and extensible platform for describing data formats. The idea is to add a DFDL formatted file that describes a given legacy file. An application thus can read and interpret this one without needing a specific reader, only the DFDL reader.
BinX is used to describe the content, structure and physical layout (endianess, blocksize…) of binary files. It will be a reference implementation for DFDL.
HDF5 (Hierarchical Data Format) is a general purpose library and file format for storing scientific data. HDF5 can store two primary objects: datasets and groups. A dataset is essentially a multidimensional array of data elements, and a group is a structure for organizing objects in an HDF5 file. Using these two basic objects, one can create and store almost any kind of scientific data structure, such as images, arrays of vectors, and structured and unstructured grids. You can also mix and match them in HDF5 files according to your needs.
Contains suggestion on data definition (like standard time formats) to enhance longevity of the file content.
Contains definition of what is an information model respect to the physical storage format and examples of metadata organization and definition.
Issues about storing Earth Observing System (EOS) satellite data with HDF format files.
Covers the data chunking and spatial index issues in HDF files.
The CFD General Notation System (CGNS) consists of a collection of conventions, and software implementing those conventions, for the storage and retrieval of CFD (computational fluid dynamics) data.
The CGNS system is designed to facilitate the exchange of data between sites and applications, and to help stabilize the archiving of aerodynamic data. The data are stored in a compact, binary format and are accessible through a complete and extensible library of functions. The API (Application Program Interface) is platform independent and can be easily implemented in C, C++, Fortran and Fortran90 applications.
Precisely defines the "intellectual content" of CFD-related data, including the organizational structure supporting the data and the conventions adopted to standardize the data exchange process.
Python CGNS access library. It defines also a DTD to validate a CGNS file structure described as XML.
From my limited observation point, I think that the best strategy to implement a Scientific Data Management solution will be an evolutionary one. It is better to start from a small solution for a specific project, then gain experience, and then implement a new step and so on.
In this process it is important to fight for a generally applicable solution, so the effort can be ready reused and extended to other projects.
To be successful, a Data Management solution should not loose contact with the user real problems.
|XML||Tree structured data||There are a lot of XML tutorials. Here are some I found interesting:
IBM DeveloperWorks and
And extensive lists of tools: XML tools by category and XMLSITES.
|DTD and XSchema||Dictionary||Generate DTD and XSchema from XML file|
|RDF||Resource graphs||RDF Primer and a practical application: Describing and retrieving photos using RDF and HTTP|
|GUI autogeneration||e.g. formSIX||FormSIX creates data entry forms from DTD files.|
|XSLT||Format transformation||XSL Reference and a XSLT processor: XT|
|XML databases||XML native||Xindice|
|RSS||Data deployment||RSS 2.0 reference|
|PHP||Server side dynamic page creation||PHP XML parser functions|
|Java||Client side applets||Java home|
This document forms the basis of discussion for the introduction of Scientific Data Management methods and tools in a specific project. The document will be update as soon as experience will be gained in this project. The goal is to implement and specify something useful that can support and help other CSCS projects.
To implement a Scientific Data Management solution in a specific project the steps to be followed are: