[Back] [Forward] [Contents] [Home]

Section 4: A Practical Evaluation of XML Technologies and TEI P4 for Archaeological Markup and Multi-layered Presentation

4.3 Methodology: the encoding of archaeological text

As the original word-processed versions of electronic text were available, this removed the need to digitise the majority of the report content. However, as noted above, images of the illustrations were created by scanning the original print versions and archive material, where available. An HP Scanjet 5370C scanner was used, with a resolution setting of 300dpi. Figures and plates are stored as JPEG image files separate from the digital text files. The electronic versions of the reports were migrated from Microsoft Word 95 and 97 to 2000, and into Microsoft Windows Notepad for markup to be applied manually by the author. Subsequent client-side processing was undertaken using Microsoft Internet Explorer 6.0 SP1 on the Windows XP platform, using MSXML 4.0 SP1 XML processor.

A number of issues were considered when deciding upon the approach to the markup of the reports, and the project evolved through a series of stages following the guidance of Morrison et al. (2002), and considering the previous approach adopted by Meckseper (2001).

In deciding upon encoding practice, Morrison et al. (2002), advise that 'the tools, techniques and standards applied should capture aspects of the report considered to be significant, whilst at the same time remaining practical and cost-effective to implement'. They also advocate that the design, construction and method of delivery of the markup (and metadata) will be influenced by both the needs and ability of the creator, as well as the anticipated users of the electronic text. It is recognised that there is no hard and fast, 'right' or 'wrong' way to apply markup to a document, but that the encoding decisions made will influence the usefulness and long-term viability of the final resource.

As part of this study, therefore, the first stage was to carry out document analysis. This entailed the examination of the three reports to understand their format, content and structure, and to finalise the general aims of the project (Morrison et al. 2002; see Table 4 below). Structure outlines were created to look at the differences between reports and look at the large structural units.

A further stage was to consider the intended readership and users of grey literature and what they might want from electronic access, as discussed above (see 4.2.2). It is recognised that every user may want something different, and that it is impossible to anticipate the needs and uses of a future audience. Whilst it may not be possible to please everyone, it is feasible to evaluate the features of key importance for specific categories of audience, and then address those needs and concerns (Morrison et al. 2002). For the majority of users, it is anticipated that key concerns relate to finding out that the resource exists, finding out summary information and finding specific details, such as specialist data, who, what, where and when?

Within the discipline of archaeology, there is no current appraisal of user needs for electronic publication. The 1998 PUNS survey has, therefore, formed the basis for the assessment of user needs for the present case study, as well as the author's own experience of curatorial practice, based with the NYCC HER, especially the use of HBSMR software and the OASIS Project for recording archaeological event and source data (Jones et al. 2001; 2003).

4.3.1 The choice of XML technologies and the TEI P4 Guidelines

A decision was made to use XML and associated technologies and to follow the TEI Guidelines having reviewed the nature and potential of the various electronic formats available (see 2.7.1, 3.2 and 3.3). Gray and Walford (1999) are strong proponents of XML and Meckseper (2001) has previously assessed available markup languages for encoding archaeological reports, choosing TEI Lite above other DTDs, such as ArchaeoML.

Having identified the specific aims of content markup (see 4.2 above), both TEI Lite and the full XML version of TEI P4 were assessed for relevant element content. The author considered that TEI Lite did not provide coverage of elements for the full range of content markup desired. For the present case study, therefore, it was decided to follow the full, P4 Guidelines (Sperberg-McQueen and Burnard 2002). As the AHDS guidance states, 'having identified the features you wish to encode, you will need to find a DTD which meets your requirements. If working with literary or linguistic materials, consider the work of the Text Encoding Initiative and think very carefully before rejecting use of their DTD' (Morrison et al. 2002). A system is needed that can be added onto existing practices, which is easily understood, and does not require complex programming skills (Wolle and Shennan 1996).

Initially, it was felt that the TEI DTD may require amendment to accommodate specific archaeological content, in particular the use of controlled vocabularies. However, through the use of a varied selection of elements and the application of attributes, it was possible to accommodate all the markup desired. The end products, therefore, have been TEI-compliant documents, unlike those produced by Meckseper (2001; see 4.3.2).

4.3.2 Structure versus content

A major factor to be considered was what was important about the text, the appearance, analysis of content, or a combination of both? In this case study, it was decided that a combination was appropriate (see Table 4 below). Encoding of elements relating to structure and appearance will enable a reproduction of the original format to be created, as well as the retrieval of certain sections, such as introduction, summary and appendices. Markup of content will be useful for readers to search for specific information to allow them to identify reports of interest and thence to retrieve relevant content. The PUNS report shows that reports are rarely read in their entirety from start to finish; certain sections are more popular than others and so it was felt important to encode these sections structurally (Jones et al. 2001; 2003).

Structure Format Content
Front matter Special characters, e.g. µ, % Who
Body matter Tables What
Back matter Paragraph numbers When
Font type and size, e.g. bold, italics, underline Where
Structure to Encode:
Front matter Body matter Back matter
Title page Introduction References
Summary/abstract Chapters/sections Appendices
Table of contents sub-headings Figures
Figure list Paragraphs Photographs
Photograph list Conclusion
Discussion
NB: Separate pages and lines will not be encoded, as this level of detail is not felt to be appropriate. Components identified by the PUNS survey as those most commonly used are shown in italics (Jones et al. 2001)
Content to Encode:
Who What Where When
People Specialist Interest Geographical Places Dates
Organisations Site area Grid references
Contractors Techniques
Authors Materials
Commissioning body Monuments/objects etc.
TEI Header
NB: As part of this case study, the author carried out a concordance between data fields in an OASIS Project record, HBSMR records and the elements and attributes available in the TEI Guidelines (Falkingham 2004)

Table 4: Summary of general structure and content markup analysis undertaken by the author prior to encoding being applied to the grey literature reports.

4.3.3 Previous examples of archaeological markup

One of the earliest discussions of electronic report creation and publication is that of Rahtz et al. (1999), who discuss the benefits of using hypertext and the application of LaTeX as an authoring system. One of the earliest articles to promote the use of XML in electronic archaeological publication is that by Gray and Walford (1999). They discuss the shortcomings of existing search tools on the Web (see also 2.8.2) and look to enhance the potential for resource discovery for the purposes of research. Gray and Walford (1999) propose the concept of Structured Site Descriptions (SSDs), and an example is given. These SSDs are proposed as an integral part of Web-published site reports or summaries, and to contain data similar to that found in the abstract or summary of a conventional printed report. The difference lies in the way that this information is structured, enabling users to conduct moderately complex searches more effectively than is currently possible. The use of controlled vocabularies is promoted for consistency of approach. The SSDs are mainly designed for indexing and searching, as a pointer to the report rather than an encoding of the full document. They comprise items such as a feature, find or site, and attributes such as location, date and size. These details are similar to those included by the author in the approach to provision of content, and referencing of external thesauri within the TEI Headers as part of the case-study (see 4.3.6).

As noted above, Meckseper (2001) has undertaken the pioneering work in the markup of archaeological reports using XML (Meckseper and Warwick 2003). Her study evaluated the views of commercial archaeological contractors (see 2.3.4) and the suitability of a range of available XML DTDs that might be used, including David Schloen's ArchaeoML (Schloen 2001, see 2.1.2) and the mda's SPECTRUM DTD (Degenhart Drenth 2001). The outcome, as per the AHDS guidance cited above, was the selection of the encoding scheme developed by the TEI and the application of TEI Lite to two archaeological reports prepared by ARCUS (Morrison et al. 2002). Meckseper's study focused upon the use of elements primarily to encode document structure, as well as the TEI Header (see 4.3.6). In addition, monuments, objects and materials, where mentioned, were encoded using controlled vocabularies within the non-technical summaries of the reports; so too were names, dates and places (Meckseper 2001). Illustrations were not included within the markup. Where material was included as appendices, Meckseper (2001, 40) restructured the reports and renumbered sections to include them in the main body of the text. In the author's own case-study, the reports have been left in their original format.

As a preliminary to the current study, therefore, the author contacted Meckseper and was kindly provided with a copy of an encoded text and associated DTD that had been compiled via the TEI Pizza Chef, and greatly reduced in size (Meckseper 2001, 39). These files were analysed to identify the way that element tags had been applied and through further discussion with Meckseper and replies to an enquiry to the TEI-L email discussion list, it was decided to use a different approach to the referencing of external thesauri and standardised terminology (see 4.3.4). That used by Meckseper had necessitated invalid amendments to the TEI DTD. The result of the work of the author has been the creation of TEI.2 conformant documents (see 4.3.5). Meckseper displayed an XML document with a CSS stylesheet, but found lack of Web browser support limited her aspirations to use XSLT at that time (Meckseper 2001, 42).

4.3.4 Terminology and data standards (Inscription wordlists and thesauri)

In order to ensure compatibility of different systems for recording and description, and thus interoperability, it is important to ensure the use of a common terminology, shared by those within a discipline who wish to provide a collective information resource and to interchange data with one another. Within archaeology, the Forum on Information Standards in Heritage (FISH) works to co-ordinate, develop, maintain and promote standards for the recording of heritage information and has compiled a list of wordlists under the collective name of 'Inscription'.

Use of these wordlists provides a means by which different aspects relating to cultural heritage can be described and indexed consistently by a range of users from national and local inventories, such as HERs, through to local societies and specialist researchers. There is a FISH email discussion list to disseminate information amongst interested parties, and to encourage comment and debate. FISH recommends that wordlists from Inscription are built into inventory databases to improve standards, and that such databases should be designed to prompt users of what terms are available, control the terms actually used, and allow the user to use the same terms to search the records for retrieval.

For the purposes of this case-study, therefore, the author felt that it was essential to incorporate these nationally accepted vocabularies into the markup of the reports wherever possible. Accordingly, relevant keyword schemes have been cited within the TEI Header using the model of:

<encodingDesc>
    <classDecl>
      <taxonomy id="EHTMT">
          <bibl>
              <title><xref doc="EHTMT">English Heritage 1998 Thesaurus of Monument Types Version 2.0</xref></title>
          </bibl>
      </taxonomy>
    </classDecl>
</encodingDesc>

The relevant wordlists used are:

All of these, with the exception of the NYCC HBSMR Development Type Look up Table, are available online via the Inscription List of Wordlists. Abbreviations of the titles of these wordlists have been used for the keywords identified in the <profileDesc> section of the TEI Header, for example:

<profileDesc>
    <textClass>
        <keywords> Archaeology, XML, TEI P4 </keywords>
        <keywords scheme="EHTMT"> post hole, ditched enclosure, pit, ridge and furrow, trackway, palisade, field system, post built structure </keywords>
        <keywords scheme="mdaAOT"> loomweight, human remains, lithic implement, ecofacts </keywords>
        <keywords scheme="RCHMETBM"> </keywords>
        <keywords scheme="ALGAOET"> excavation, trial trench </keywords>
        <keywords scheme="RCHMEAPL"> Bronze Age, Iron Age, Roman, Medieval, Post Medieval, Uncertain </keywords>
        <keywords scheme="EHREP93LU"> cultivation to a depth >0.25m </keywords>
        <keywords scheme="NTSMRPGSL"> </keywords>
    </textClass>
</profileDesc>

References within the main text have been encoded using the <term> element, with the type attribute value identifying the relevant wordlist. The aim has been to preserve the original report text and not to alter the original author's description. Where appropriate therefore, a regularised attribute value has also been included to map the term used in the report to the equivalent term used in the wordlist, as in the example below:

<term type="ALGAOET" reg="trial trench">evaluation</term>

These wordlists are, of course, only those selected by the author as being relevant to the particular content of the three selected archaeological reports, and the aims of the markup for this case study. Any number of additional wordlists or thesauri could be included from other disciplines, such as biology, should the encoder wish to describe and highlight specific occurrences of particular species within, for example, the specialist reports on palaeoenvironmental material (see 4.2.2). Similarly, particular pottery fabrics and forms could be encoded according to agreed terminology for cross-report searching and retrieval. For archives, the 1999 Archaeological Archive Types wordlist, also available from Inscription, could be used with the <term> element similar to those above.


[Back] [Forward] [Contents] [Home]