2. Methods for Populating Semantic Database

Various methods were developed in order to prepare the central project database (triple store), making use of semantic representations. The different terminology resources were represented in SKOS (RDF) format and semantic terminology (web) services were developed for project use. Datasets were mapped to the core ontology and extracts from the different databases were represented in RDF. Natural Language Processing (NLP) techniques were developed to automatically generate rich metadata (semantic annotations) from free text grey literature documents, in terms of the same core ontology, to enable subsequent cross search.

2.1 Terminology resources and services

Various resources provided controlled terminology for data extraction and NLP techniques. The main EH thesauri used were Monument Types, Archaeological Objects, Building Materials, Archaeological Sciences and the experimental Timelines Thesaurus. Various EH glossaries were also available (EH Recording Manual 2006), which were particularly useful for the NLP techniques. In particular, the Glossary of Simple Names for Deposits and Cuts was developed for project purposes into a simple thesaurus for context and group types and expanded with some relevant terms derived from analysis of the OASIS corpus. This provides terminology control for contexts and group types in the Demonstrator, while Archaeological Objects and Building Materials thesauri provide control for finds and materials types.

The W3C Recommendation SKOS (Simple Knowledge Organization System) was selected as the representation format for all thesauri and glossaries. For details of the conversion process, see Tudhope et al. (2008). Relevant glossaries were also represented as SKOS. A URI format was adopted for all IDs (Binding et al. 2008).

A significant number of data fields were short, free text strings, in some cases interpretations or descriptions, but in others effectively acting as context types. This may in part result from a lack of need for control in databases not designed for cross-searching purposes (section 4.3 discusses the issue further). The need for data cleansing and control of these fields posed an unexpected challenge and an 'alignment' step was required. The different data fields mapped to a particular ontology class were exported as unique data instances to a spreadsheet and ordered alphabetically to assist intellectual assignment to directly matching controlled terms. Variant spellings and synonyms were taken into account.

A web service architecture was developed for STAR (Fig. 1); the terminology and ontology resources are accessed via a web service API (Application Program Interface). While this was developed for project purposes, the web services can be accessed externally and have been employed in other research projects. The EH Thesauri, including the experimental Timelines Thesaurus, and the various glossaries are available for programmatic access via the API, which is described on the STAR website, as well as a variety of user interface 'widgets' that draw on the services (Binding and Tudhope 2010). As part of the project, some specific techniques were developed for handling archaeological time periods, which are expressed in a wide variety of formats. The STAR.TIMELINE console-based application assigns a known time period identifier to expressions of time spans in archaeological database records, using the EH Timelines thesaurus or MIDAS period list (Binding 2010).


© Internet Archaeology/Author(s)
University of York legal statements | Terms and Conditions | File last updated: Mon July 18 2011