4.3 Terminology control and data cleansing

Mapping to the ontological framework is not in itself sufficient for semantic interoperability if the vocabulary problems mentioned in section 1 are not resolved. Even the trivial difference (say) between posthole and post-hole can thwart attempts for interoperability. These difficulties are illustrated in the Demonstrator with the Sample field, where no controlled type vocabulary was identified.

One unanticipated outcome was the extent of data cleaning and terminology control work required to achieve the cross-searching aims. This was expected for the grey literature but not for the dataset sources. Some data fields were effectively controlled types (e.g. hearth) within an organisation's practice. However, different organisations tend to employ different glossaries. Some data fields were free text strings but effectively acting as a type designator (e.g. possible small hearth). Since the original datasets were created for excavation recording system purposes rather than digital publication for wider research access, the need to address semantic alignment was perhaps less relevant at that time. For STAR and future digital dissemination, however, it is a critical issue.

Section 2.1 describes the methods employed, which included intellectual alignment (with a little automatic assistance) of both type fields and free text note fields, where appropriate. The alignment was performed by team members and some interpretations may be subject to debate or qualification (the method should include a review step in operational systems). The spreadsheet method described offered a tractable solution for project purposes. However, it is possible to envisage further tool assistance to speed the process, where automatic suggestions are subject to expert intellectual review.

Another possibility for automated terminology assistance can be observed in the Mosaic/Tessellated floors scenario, where the Demonstrator query for contexts is floor: tessellated – the term is suggested from the controlled terminology once the user begins to type the word, floor. However, this requires the user to have a rough idea of this formulation (they need to start typing). Although mosaic is part of the controlled terminology for finds (as in the scenario), entering 'mosaic' as a context query will not deliver any results. Expanding the current term suggestion facility to use an 'entry vocabulary' of related search terms would improve the user interface. This was explored in previous work (Tudhope et al. 2006) but as discussed in section 4.4, the integration of SKOS-based services is not facilitated by the current SPARQL platform.

There is potential for further terminology development to support cross search; for example, making a timelines thesaurus widely available and perhaps a thesaurus for context and group types (the thesaurus developed for STAR demonstration purposes could offer supporting material but is not intended as a publishable version). Is there potential for establishing standard (national) vocabularies at the excavation record level, perhaps unifying the various organisational glossaries to support cross-search semantic interoperability?

Hodder (1999, 94) discusses the introduction of typed categories into the recording system to support interpretation in the field, with hierarchies of terms ranging from the general to the specific and idiosyncratic. He goes on to observe that the process of categorisation is emergent, embedded within a reflexive 'process of discovery' that interprets the context. Premature imposition of rigid categories may distort a still to be understood complex situation. On the other hand, without some vocabulary control, the possibility for automated tools that support comparison via cross search is much reduced. Hodder goes on to suggest that modern computing systems might be used for comparison and retrieval over large data stores, while employing codification with some flexibility, thus permitting contextualisation and reflexivity in automated approaches.

Retaining the original free text notes in a search system, alongside the semantic typing, is a basic step that could potentially support the provision of both string search and semantic search. A glossary of qualifiers would help NLP techniques take account of tentative interpretations, such as a possible hearth, although consideration would be required when modelling such statements within the ontological framework. Currently, STAR semantic annotations include qualifiers as strings for human inspection purposes but they are not modelled at a semantic level. Regarding standardisation of terminology more generally, discussion with archaeologists at STAR project workshops tended to favour provision of mappings between major organisational glossaries, rather than any attempt at standardising on national glossaries for fields such as finds. While this was inevitably a small sample of opinion, mappings between SKOS glossaries are quite feasible. Semi-automatic tools for suggesting mappings could be a future direction in terminology services and STELLAR template work, along with semi-automatic tools to assist the semantic alignment of free text notes with controlled type glossaries or thesauri. It is possible to combine standards-based terminology control with the flexibility of free text description/interpretation. To assist such work, some differentiation of the various types of recording system notes and their provenance would be useful if such categorisation were feasible. Flexible application of terminology services assisted by NLP techniques could be one avenue for exploring how to make the free text contextual information from context sheets available for subsequent reflection and comparison of interpretations.


© Internet Archaeology/Author(s)
University of York legal statements | Terms and Conditions | File last updated: Mon July 18 2011