Internet Archaeol 15. Nick Ryan. Modern databases and archaeology

4. Modern databases and archaeology

Huggett (this volume) discusses the problems of misleading apparent objectivity in data that have been recorded using formal, highly structured, methods. A related issue, one that informs many of the articles in this volume, is the impact on archaeological thinking and processes of the underlying models and assumptions built into computer software. Are these models and assumptions appropriate to the nature of archaeological data, and are the transformations and analytic processes performed by the software appropriate to an archaeological process? For example, is it safe to assume that instances of a variable are normally distributed in the background population, or are we using a method designed to deal with continuous phenomena to describe or analyse a sparse discrete distribution?

Database systems, because they deal simply with the management of data that the user has explicitly chosen to record, are inherently more neutral than, say, statistical analysis methods or raster GIS models where the operations or representations require assumptions about the nature of the data that may not be entirely applicable. In practice, most modern database systems are capable of storing anything that is amenable to digital representation. Such a representation is, of course, no more than a digital surrogate for some original physical object or concept. There are always interpretative steps between the original and the representation. These steps may involve transformation, distortion or loss of information, but these are properties of the interpretation and the chosen representation, not of the database in which it is stored. The data model (relational, OR or OO) and, to some extent, the particular DBMS only limit the ways in which the representation can be manipulated by the DBMS. In some cases, additional programming effort may be required to overcome these limitations. The importance of careful choice of representation, including appropriate levels of precision and granularity cannot be exaggerated.

It is no doubt true that many, potentially important, observations have gone unrecorded where site notebooks have been replaced by recording forms and data input screens. Nevertheless, it must be stressed that this is primarily a product of poor database and form design, rather than a necessary characteristic of structured records. Such poor design often results from a limited experience of modelling and, consequently, leads to a lack of generality in the solutions that are developed. The end result may be a system that forces users to distort their data to fit an inadequate model. Given the concentration on well-defined problem spaces and imposing rigid structure found in most data modelling texts, it is hardly surprising that many people are unaware of the possibilities for building flexible systems. It is, however, quite possible to design a database that offers the possibility of adding types of information and relationships not envisaged at design time.

For example, despite widespread popular belief, it was always possible to store long text items, or even images, in most relational systems even before explicit methods were provided for such purposes. It was just rather difficult for most people to implement because it required appropriate experience of data modelling and some programming effort. In the case of long text items, these could be searched and modified in ways that are quite impossible with the primitive 'memo' field construct that was added to some desktop products. Similarly, it is possible to build general-purpose database structures that permit recording of ad hoc observations that were not envisaged in the initial design. An interesting example of this approach is seen in Madsen's IDEA and GUARD systems (Madsen 1999; 2001).

Another issue raised by Huggett is the static nature of the data recorded in most databases. A typical database models a 'snapshot' of current values, with updates replacing previously recorded data. Several research prototype DBMS have explored alternative approaches in which data is never overwritten and remains accessible through query language extensions. Typically, these systems add 'start' and 'end' timestamps to each tuple/record to define the period of validity of the data. When a new record is inserted, it receives a 'start' timestamp but the 'end' remains empty. At any time, only one version of a record has an empty 'end' and is regarded as the current version. Instead of overwriting the original record, subsequent updates set the 'end' timestamp of the current record to the current time and insert a new record with the same value in its 'start' timestamp. Deletion simply sets the 'end' timestamp of the current record. More complex approaches in which the system maintains similar pairs of timestamps for individual attributes have also been used.

Despite the attraction of being able to explore the historical development of database contents, such 'transaction time' temporal databases have not been widely developed beyond the research laboratories. It is, however, possible to add basic temporal capabilities to existing databases using triggers to generate log records of significant changes as in the example in the preceding section. Similar approaches may be taken to develop historical databases in which the timestamps represent the period when the stored 'fact' was 'true' in the real, historical, world.

Preserving any uncertainty implicit in original observations is more problematic. For example, the use of real numbers in the earlier examples of dimensions might lead to an unwarranted faith in the precision of the measurements. For many purposes, this might not be a serious problem, although integer numbers might be a better choice. In other cases, a single value may not be appropriate and the uncertainty might be better expressed as a range of values. In a conventional database this cannot be represented by a single attribute; instead, two values might be used to represent the ends of the range. Testing for equality, or overlap of ranges, would then require more complex query statements or programming. An alternative approach using a fuzzy data type developed for the OR DBMS PostgreSQL is demonstrated by Niccolucci et al. (2001). In their example, fuzzy values are used to record osteological determinations of gender and age. When used together with a 'fuzzy equality' function this data type enabled them to address demographic questions about their cemetery populations using simple query statements.

A similar approach could be taken to record and manipulate date ranges. In this case the SQL date type is quite unsuitable for many archaeological purposes as it has a granularity of a single day and is usually limited to the Gregorian calendar. A fuzzy date range, or the more complex variable granularity approach used by Bagg and Ryan (1997), also using PostgreSQL, offer possible solutions. Other temporal database issues, particularly in relation to their use with GIS, are addressed by Johnson and Wilson (2002).