Semantic Data Management

Coordinator: Prof. Dimitris Plexousakis

During last years we have witnessed a tremendous increase in the amount of semantic data that is available on the Web in almost every field of human activity (e.g. Wikipedia, U.S. Census, CIA World Factbook, open government sites in the US and the UK, news and entertainment sources, e-science ontologies and data) have been created and published online. Semantic data management refers to a range of techniques that can be employed for storing, querying, manipulating and integrating data based on its meaning. Current ISL activities in this area focus on three main topics: provenance, large scale semantic integration and the management of the evolution of semantically-enriched datasets (change detection, evolution management, inconsistency handling).

Provenance means the origin of data, and refers from where and how the piece of data was obtained and it is critical in applications such as trust assessment, access control, data cleaning and data/information quality. ISL activities in the domain of provenance for semantic web data target the following research activities:

  • development of annotation and abstract provenance models for Semantic Web Data
  • efficient storage and indexing schemes for the provenance of Semantic Web Data and
  • provenance manipulation languages, including languages for querying provenance-enriched datasets

 

Since the traditional data integration approaches cannot scale easily to large number of sources, ISL also focuses on methods for Large Scale Semantic Integration including:

  • indexes, algorithms, services and measurements about the connectivity of the entire LOD cloud
  • extensions of SPARQL  for exploiting various kinds of sources in the query process
  • algorithms for entity matching (including matching anonymous entities)
  • tools for automating the creation of Semantic Warehouses

 

Datasets found on the web exhibit a highly volatile and dynamic behaviour, as the information they encode is constantly subject to change, due to new data, erroneous inclusion of information, the discovery of inconsistencies etc. In this setting, several problems arise, including:

  • Deciding how to include new information (at the data and/or schema level) without creating inconsistencies or incoherencies
  • Resolving inconsistency using logical rules and preferences that determine the most “plausible” resolution method
  • Detecting changes among different versions of a dataset, in a manner that produces concise and intuitive deltas, while at the same time being formally shown to satisfy desirable properties, such as non-ambiguity and completeness of deltas, efficient and deterministic detection, as well as appropriate semantics for executing the deltas and allowing the forward and backward propagation of changes in a multi-version repository
  • Managing different versions of a dataset, and identifying efficient methods for storing and/or querying such versions, based on deltas

 

In support of national Precision Medicine Networks, the laboratory designed a full-fledged ICT ecosystem that supports the integration of clinical, -omics and imaging data. It comprises of a suite of tools for managing clinical, genetic, molecular, biochemical, and epidemiological parameters related to the development and evolution of diseases that have a genetic cause.