Internet Archaeol 17. Falkingham.

Section 4: A Practical Evaluation of XML Technologies and TEI P4 for Archaeological Markup and Multi-layered Presentation

4.3 Methodology: the encoding of archaeological text

4.3.5 The compilation of the document type definition (DTD)

All TEI-encoded documents use the same top-level DTD file, which refers to a number of other DTD files. The exact set of other files referred to depends upon which base and which additional tagsets are in use. The TEI DTD is described in detail in Chapter 3 of the XML version of the TEI Guidelines (Sperberg-McQueen and Burnard 2002). The main TEI DTD is always invoked by specifying the file 'tei2.dtd' at the start of an XML document.

The DTD used in this case study was customised and downloaded using the TEI Pizza Chef. The TEI DTD is XML-compliant and comprises a series of subsets which can be selected according to the particular needs of a project. Firstly, a base tagset is chosen, or a series of base tagsets, although the former is recommended. For this case study, the 'prose' base was selected, to which were added a selection of five 'toppings' from a choice of twelve additional tagsets. These comprised 'linking', 'certainty', 'transcr', 'names.dates' and 'figures' and are declared in the document header, as below:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN" "tei2.dtd" [
<?!ENTITY % TEI.XML 'INCLUDE'>
<?!ENTITY % TEI.prose 'INCLUDE'>
<?!ENTITY % TEI.linking 'INCLUDE'>
<?!ENTITY % TEI.certainty 'INCLUDE'>
<?!ENTITY % TEI.transcr 'INCLUDE'>
<?!ENTITY % TEI.names.dates 'INCLUDE'>
<?!ENTITY % TEI.figures 'INCLUDE'> ]>

Finally, an entity set has been chosen that will insert appropriate Unicode declarations into the DTD to ensure that any named character entities used in the markup will be defined. In this instance ISO Latin 1 was selected. A full DTD can thus be generated and the result is a file named 'tei2.dtd'. All of the XML files created as part of this case-study share this common DTD.

4.3.6 The TEI Header

The details of the TEI Header are contained within Chapter 5 of the XML version of the TEI Guidelines (Sperberg-McQueen and Burnard 2002). The TEI Header is mandatory if a document is to be considered TEI-conformant and is an essential resource of information for users of a text, recording information about its print source. The more is known about this, the more comprehensive the header will be. The header is the TEI's approach to recording effective metadata documentation, similar to the use of Dublin Core metadata. However, unlike Dublin Core, the TEI Header is not intended only for describing and locating objects on the Web. It enables all aspects of an electronic text to be documented, including its source, encoding practices and creation (Morrison et al. 2002).

A number of structured, optional elements are offered by the TEI Header which can be extended by the addition of attributes. As a result, a TEI Header can be a large and complex document, or a simple, short section of metadata (Mueller 2002). For the case-study report examples, as much detail as possible has been included within the TEI Header, which appears at the beginning of a document, between the prologue and the <front> matter.

The TEI Header comprises four principal elements, only the first of which is mandatory (Sperberg-McQueen and Burnard 2002):

<fileDesc>: the file description. This element contains a full bibliographic description of an electronic file, including title, author, funder and encoder, edition, date, publisher, place, address, id number, availability and notes.
<encodingDesc>: the encoding description. This element documents the relationship between an electronic text and the source(s) from which it was derived, including project description, standard values, and taxonomy: reference to controlled vocabularies (see 4.3.4).
<profileDesc>: the profile description. This element provides a detailed description of the non-bibliographic aspects of a text, specifically the languages and sub-languages used, the situation in which it was produced, the participants and their setting, also including reference to the keyword schemes identified by the controlled vocabularies in the <encodingDesc>.
<revisionDesc>: the revision description. This element summarises the revision history of a file, the encoder and the dates.

TEI Header metadata can be extracted and mapped onto other well-established resource cataloguing standards, such as library MARC records, or on to other standards such as the Dublin Core element set and the Resource Description Framework (RDF) (Furrie 2003). This is a relatively simple procedure as the TEI Header was closely modelled on standards in library cataloguing (Morrison et al. 2002). The Oxford Text Archive, for example, uses the Header to manage its collection of electronic texts, for creating a searchable catalogue to retrieve resources and also as a means of creating other forms of metadata that can be exchanged with other information systems (Morrison et al. 2002). However, Morrison et al. (2002) note that this has been hindered by the flexibility of the rules for header creation, 'the variant way in which the guidelines are interpreted and put into practice make easy interoperability with other systems using TEI Headers more difficult than first imagined'.

Another observation of Morrison et al. (2002) is that the lack of affordable and user-friendly software for header production has contributed to 'the relatively slow uptake and implementation of the TEI Header as the predominant method of providing well-structured metadata to the electronic text community as a whole'.

4.3.7 The main text: front, body and back matter

The default text structure for all TEI-conformant documents is outlined in Chapter 7 of the XML version of the TEI Guidelines (Sperberg-McQueen and Burnard 2002). A document is marked with the <text> tag and may contain front matter, a body, and back matter:

<front> contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found at the start of a document, before the main body.
<body> contains the whole body of a single unitary text, excluding any front or back matter.
<back> contains any appendixes, etc. following the main part of a text.

The overall framework for the structure of a text is thus (Sperberg-McQueen and Burnard 2002):

<TEI.2>
<teiHeader>  </teiHeader>
    <text>
     <front>
       
     </front>
     <body>
       
     </body>
     <back>
      
     </back>
</text>
</TEI.2>

The structure outlines of the reports selected for the case-study were used to decide upon the approach to encoding within these three basic textual divisions. Apart from a slight difference in content between the front matter and body of the text, the Catterick and Roecliffe reports are very similar (Simpson 1996; Young and Fraser 1998). Where possible, a consistent approach to encoding was taken. The Thirsk report, however, by virtue of its brief nature and presentation as a list, was slightly different in overall structure and contained no back matter (Cooper 2003).

Front matter has included, for example, the main document title <docTitle> and date <docDate>. Body text has been divided into a series of <div1> elements for the main sections, <div2> elements for subsections and <p> elements for individual paragraphs. Divisions may have their own headings, identified by the <head> element, and the end of the main body of text may have a <closer> and <byline> citation for the author, report number and credits, such as the author and project manager. Divisions may also be further identified by a type attribute, such as abstract, chapter or sub-section, and allocated a specific number, for example:

<div1 type="Chapter" n="4">
<head rend="small-caps"> 4.0 ARCHAEOLOGICAL BACKGROUND </head>
<div2 type="sub-section" n="4.1">
<head> 4.1 Service Trenches </head>
<p>  </p>
</div>

Back matter is divided in a similar manner to the body of the text, using <div1> and <div2> elements for the main appendices and subsections. Bibliographies have been encoded as lists. Figures and plates have been marked up using the <figure> element and the writer has added in <figDesc> elements to add a description of each (see below).

<div1 type="appendix" n="5" id="human_remains">
<head id="1"> Appendix 5: Assessment of cremated human bone </head>
<persName type="author"> J. Langston </persName>
<div2 type="sub-section" n="1">
<head> Methodology </head>
<p>  </p>
</div>
</div>

<div1 type="appendix" n="6" id="figures">
<head id="1"> Appendix 6: Figures and Plates </head>
<figure entity="RFig1">
<head> Figure 1: Site Location </head>
<figDesc> Plan reproduced from the Ordnance Survey 1:25000 map, showing site location south west of the town of Boroughbridge, and east of the A1 road.
</figDesc>
</figure>
</div>

The level of detail to which the reports' structure and content has been encoded has been influenced principally by user needs identified by recent national surveys and the potential for export of data for the population of other heritage datasets (see 4.1.2). Individual elements and attributes used for the encoding of OASIS Project and HER HBSMR records were also identified (Falkingham 2004).

The author has made wide use of attributes in this case study. There is debate, however, regarding the choice of using attributes, as opposed to elements. Some see it as a matter of personal choice, others see it as bad practice (Castro 2001). The use of attributes presents limitations when using CSS1, which can only be used to apply style to element level. However, it is possible through the use of XSLT and XPath to retrieve content based upon attribute data by parsing the node tree. In addition, certain special characters in XML require character references (Harold and Means 2002, 82). These have been added to the report encoding where appropriate.