2.2 Data extraction and mapping to CIDOC CRM

The selected datasets were inspected to assess their data structures and how best to relate their specific data fields to the CRM and CRM-EH. In most cases we did not have specific metadata nor descriptive scope notes on what the field content was intended to represent. It was therefore important to review the actual data within the field, as well as the field label, to judge the intended 'meaning' of each field. We recommend assessing the data content, even when a field label seems familiar.

An initial mapping was carried out between the Raunds database fields (exported into a spreadsheet format for ease of reference and annotation) and the CRM-EH model. This was cross-referenced with the CRM-EH diagram and scope notes. In some cases, particularly data fields containing literal data types (e.g. string, date or value), fields were mapped at the CRM level (e.g. Year of Excavation mapped to Time Primitive E61) rather than CRM-EH extensions. Additionally, decisions on the specific RDF implementation of some CRM primitives were required (Binding et al. 2008).

Since the general meaning of many fields in RRAD corresponded to the EH Recording Manual, a conceptual mapping between the CRM-EH and the field heading descriptions used in the manual was produced. The conceptual definitions in a recording manual provide more certainty on the meaning of fields for mapping purposes and should hold for all database implementations based on that recording manual. This more general mapping also made it easier to identify fields corresponding to the concepts covered by the EH Recording Manual in both the Silchester LEAP and MoLA datasets.

The CRM and CRM-EH classes (and associated attributes) that were mapped included Context, Context Depiction (spatial coordinates), Context Find, Context Find Material, Context Find Intended Use, Context Find Note (as string), Context Event, Context Sample, Group, Timestamp (time periods), Context Find Measurement Value. Stratigraphic relationships between contexts were also extracted.

The different datasets hold between 100 to 500 individual data fields. Given the limited project resources, it was decided that the final research Demonstrator should focus on a subset of the data, for purposes of demonstrating semantic interoperability in cross search. The selection of records covered four main areas of archaeological activity, Contexts, Finds, Samples, together with the grouping of context records to form more generalised interpreted features, such as buildings, enclosures, structures, etc. Section 4.2 discusses cost-benefit issues associated with granularity of detail in the mapping.

Mappings of excavation datasets to the CRM and CRM-EH were generally made intellectually by domain expert (May) and communicated via spreadsheet to the development team, who realised the mappings (in dialogue with May), extracting the relevant data from various databases via SQL queries and creating sets of RDF files. An interactive mapping and data extraction tool generated the RDF statements and proved vital for timely and accurate completion of the work. Further details of the tool are provided in Binding et al. (2008), which also reviews other CRM-based mapping work.

Frequently, the mappings did not result in a simple 1:1 correspondence between data fields and ontology entities. For example, a CRM-EH ContextFind class might be mapped to multiple fields in different database tables (the extracted data being a union of those fields), or a CRM class identifier might be derived by combining field and table names. The production of unique IDs for each entity extracted, expressed as URIs for semantic web purposes, was an essential element of the process. The event-based nature of the CRM also had to be taken into account, with events often only implicit in the original relational database structures. Intermediate 'virtual' entities had to be created automatically to model this event information. Thus a database element might be mapped to a chain of CRM relationships, expressed by multiple RDF statements.


© Internet Archaeology/Author(s)
University of York legal statements | Terms and Conditions | File last updated: Mon July 18 2011