Internet Archaeol 17. Falkingham.

Section 4: A Practical Evaluation of XML Technologies and TEI P4 for Archaeological Markup and Multi-layered Presentation

4.4 Methodology: transformation and presentation of multi-layered output

4.4.3 The use of XSLT and XPath for retrieval of selected content

It has been shown above (see 4.4.2), how CSS can be used to style an XML document in a manner similar to the viewing of an (X)HTML webpage. However, with XSLT, more powerful operations can be performed than just style formatting; a document may be transformed into different formats, such as new XML, XHTML or plain text, or tailored for different media, such as a hand-held device (see 3.5.2). XSLT can also be used to reorder information, or to output a specific part of a document (Castro 2001, 135).

The TEI website provides access to a number of XSL stylesheets for TEI XML. However, for the present case-study, the author has scripted her own basic stylesheets for specific purposes. Two XSL transformations have been applied to each of the three reports, with the aim of extracting content and data for the general public and for curatorial purposes (see 4.2.2). As only the Roecliffe report contained specialist analyses and these are within appendices, XSL transformations have been applied to two examples of appendices to retrieve them for specialist viewing (Young and Fraser 1998). These are outlined below.

4.4.3.1 Public: the extraction of report summaries, conclusions and images

Market Place, Thirsk: Negative Watching Brief Report

Catterick Bridge: Archaeological Monitoring and Recording of Strengthening Works

Roecliffe Lane, Boroughbridge, North Yorkshire: Archaeological Evaluation Report

Click on a link above to view the XSL transformations discussed in the text.

For public viewing (see 4.1.2 and 4.2.2), user needs analysis has identified that summaries, conclusions and images are the most commonly sought report contents. Accordingly, the aim of these transformations has been to extract these for viewing. This transformation retrieves content mainly based upon the encoding of report structure.

At the top of the XSL documents, there is an XML declaration and the stylesheet identifies the URI namespace for the XSL specification. Microsoft Internet Explorer 5.0 and 5.5 only partially support the working draft of XSLT, and do not support XSLT 1.0; therefore the examples presented here would not work client-side unless the user is using Internet Explorer 6.0, or an equivalent XML/XSL-supporting Web browser.

The script of the basic XSL stylesheet for the Thirsk report transformation is shown below:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<html xml:lang="en">
<head>
<style type="text/css"> body {font-family : Arial, Helvetica, sans-serif; display:inline; }
</style>
<title>Extracting Summary, Conclusions and Images for the Public</title>
</head>
<body>
<h3 id="top"> Results Extracted from the Thirsk Report for Public View:</h3>
<xsl:apply-templates select="/TEI.2" />
</body>
</html>
</xsl:template>
<xsl:template match="TEI.2">
<h2> <xsl:apply-templates select="text/front/titlePage/docTitle/titlePart"/> </h2>
<p> <xsl:apply-templates select="text/body/div1[@type='abstract']"/> </p>
</xsl:template>
<xsl:template match="div1/head">
<h4><xsl:apply-templates /></h4>
</xsl:template>
<xsl:template match="p">
<p><xsl:apply-templates/></p>
</xsl:template>
</xsl:stylesheet>

In the Thirsk report markup, the single paragraph of results has been encoded within the body of the report as <div1 type="abstract">, with a <head> heading and <p> paragraph. In the Catterick and Roecliffe reports, it is within the front matter <front>.

A variety of rules have been included in the above stylesheet example. A template rule, using <xsl:template> controls what output is created from what input. The attribute 'match' contains an XPath pattern to identify the input it matches, for instance, the root template <xsl:template match="/"> relates to the entire document. This root template contains literal elements, such as the <html> tags, which enable content written directly into the stylesheet to be represented literally in the result tree when processed, such as the title at the top of the screen (Fitzgerald 2004, 23). The root template also contains XSLT instructions: <xsl:apply-templates>, which identify a node set and make explicit the choice of processing order. The attribute 'select' uses an XPath expression to tell the XSLT which nodes to process at that point in the output tree, hence: <xsl:apply-templates select="text/body/div1[@type='abstract']"/> instructs the query to process the <div1> element of the type 'abstract' located within the body of the text (Harold and Means 2002, 140). Thereafter, having found this section of the text, other templates are applied to it to output the heading and paragraphs it contains.

In the Catterick and Roecliffe report examples, the same basic stylesheet has been used, with slight amendments and additions due to differences in report structure and content – for example searching the front matter for the summary and the body for the conclusions. To retrieve images, different approaches have been demonstrated in the Catterick and Roecliffe examples.

For the Catterick example, the model of:

has been used to retrieve a thumbnail of a figure, followed by an individually selected related figure heading, numbered uniquely for this purpose within the markup using an 'id' attribute.

For the Roecliffe example, the model of:

has been used to retrieve a list of all figure headings, followed by the retrieval of thumbnail images, displayed at a smaller size than those in the Catterick example, in a tabular format. This has been achieved using the literal elements of <th>, <tr> and <td> (the same as with (X)HTML). Individual images are retrieved with the same <a href> element as for the Catterick example above.

It is recognised that the quality of some of the scanned images is less than desirable. This is a result of scanning photocopied copies, as most of the originals were not available except for the Catterick plates (see 4.3). For presentation of images on the Web, the guidance of TASI (2002-2004) is helpful, and alternatively, SVG could be applied if digital vector drawings are available (see 3.5.4.2).

Prior to achieving the desired results with the template-matching examples shown above, the alternate method of using XPath <xsl:value-of> and <xsl:for-each> expressions to select content was tried. However, this was only partially successful as this will only output the string-value of the first node in the set (Castro 2001, 142). So, although the header is returned, only the first paragraph was returned in the output. For the Thirsk example, this was not a problem, as there is only one paragraph to retrieve. However, for the other examples, with several paragraphs, the only solution was to assign each paragraph a number through the use of a separate numerical attribute, for example <p n=1>, <p n=2> and to select each individually. This was felt to be too cumbersome for the application of markup, and template matching is seen as a far neater solution. For the retrieval of image files, the use of an <xsl:variable> instruction could also be used (J. Cummings, pers. comm. August 2004). There are various methods within XSLT for achieving a similar result.

4.4.3.2 Curator: the extraction of OASIS Project and HBSMR-related data

Market Place, Thirsk: Negative Watching Brief Report

Catterick Bridge: Archaeological Monitoring and Recording of Strengthening Works

Roecliffe Lane, Boroughbridge, North Yorkshire: Archaeological Evaluation Report

Click on a link above to view the XSL transformations discussed in the text.

From a curatorial perspective, selected data have been extracted for input into other systems, such as HERs and the OASIS Project database. For this XSLT, a concordance was drawn up between the TEI elements, OASIS and exeGesIS Spatial Data Management Ltd's HBSMR software data fields (Falkingham 2004). This helped to formulate the approach to the content markup of the three reports for this case-study, and also to structure the scripts of the XSLT stylesheets. This transformation retrieves content based upon both the encoding of report structure and specific content.

Data has been selected from multiple parts of the document, from the TEI Header, front, body and back matter. The aim has been to extract data for input into an OASIS Project record, where such data was present in the reports. There are a few additional fields relevant to HBSMR which have been added to the end of the stylesheet. The output has been styled to reflect the layout of the HTML view of an OASIS Report. Alternatively, the stylesheet could be reordered to present a listing appropriate for HBSMR.

One XSL stylesheet has been produced for the Thirsk report and another that can be applied to both the Catterick and Roecliffe documents. As individual elements are being sought, the literal result elements <xsl:for-each> and <xsl:value-of> have been used. For multiple returns, template matching would, however, be more appropriate, similar to the examples in the previous section (see 4.4.3.1).

The <xsl:for-each> instruction is used to look through the input tree to match patterns of elements specified in the 'select' attribute, in this example this is the whole document. Having done so, the element <xsl:value-of> calculates the string value of an XPath expression (abbreviated by use of the double forward slash '//' to select from all descendents of the context node), and inserts this in the output document; the 'select' attribute is used to choose the required value (Harold and Means 2002, 162). What appears in the output is the text content of the selected element after the tags have been removed and the entity and character references have been resolved (Harold and Means 2002, 142).

Content could be tailored to XML, (X)HTML or plain text output through the use of <xsl:output method="xml"/>, for example (Fitzgerald 2004, 47). However, the client-side processing for this case study, using Internet Explorer 6.0 SP1 and MSXML 4.0 SP1, did not render the output in this way. However, this would work with a server-side transformation, or with a processor that fully implements the XSLT specification.

It has not been possible to retrieve content for all equivalent OASIS and HBSMR database fields owing to lack of related information in the reports, such as the lack of a site status or significant finds, and non-specification of project personnel. In addition, in relation to project archives, no report made reference to the content or location of project archives, except for the location of the Catterick archive. Appraisal of the available TEI P4 elements indicated that there was no immediately obvious means to encode archive content details, had they been included. However, reference could be made to an external wordlist (see 4.3.4) and the <term> element used. The recently launched LEADERS Toolkit, has sought to devise a means of bringing together TEI and EAD, and may prove useful in this respect. Comments to the TEI-L discussion list have suggested that a TEI special interest group should be established to develop a set of extensions to TEI to cover archival material in the spirit of the CIDOC CRM (see 3.3).

Another issue has been the retrieval of dates. The date of the report production is explicitly stated, and this presents no problem. However, with regard to the dating of individual monument types, at the time of writing, no means have been found to link particular dates with particular features. Instead, the relevant periods represented by the discoveries of each project have been included as keywords within the <profileDesc> of the TEI Header of each report, following the English Heritage 1998 RCHME Archaeological Periods List Version 1.0. In the Roecliffe report example, the issue has been exacerbated by the lack of absolute dating evidence from the evaluation and the uncertainty of the period attribution of particular features (Young and Fraser 1998). Whilst general conclusions can be drawn about the dating of the site, the details for individual features are less clear.

The author considered that extraction of data from reports to populate these heritage datasets would be a useful exercise. If it were possible to apply this on a national scale, this would have the potential benefit of reducing duplication of effort between a number of sectors of the profession, such as HERs completing event and source records, the AIP Project visiting each HER to complete their records, contractors completing OASIS Project records, and English Heritage populating NMR records, all of which are recording essentially the same data. If relevant XML encoding can be added to reports and stylesheets applied to export data, import scripts can be devised to automate the process of data exchange with these other heritage datasets. If this were implemented, then the manual input of data could be dispensed with, and potentially, backlogs could be reduced and currency of data in these systems improved. This is already one of the aims of the OASIS Project and FISH Interoperability Toolkit, and at the time of writing, it is envisaged that import and export scripts will be available soon. The above scenario, however, takes the process a step further back from the completion of an OASIS Project Form and into the writing of the report itself, thus blurring the boundaries between the textual material and the database.

4.4.3.3 Specialist: the retrieval of the specialist appendices

Appendix 3: Assessment of the environmental samples. J.P. Huntley

Appendix 5: Assessment of cremated human bone. J. Langston

Click on a link above to view the XSL transformations discussed in the text.

Only one of the reports contained specialist appendices, that for the Roecliffe evaluation (Young and Fraser 1998). To identify each uniquely, it has been necessary through the encoding to apply an 'id' attribute (see 4.3.7). Similar to the stylesheets generated for public viewing (see 4.4.3.2), template matching has been applied to retrieve the content of selected appendices, specifically those dealing with human remains and environmental samples. The same base XSL stylesheet has been used for each appendix, the only difference being the attribute chosen to select each alternative appendix. This simple XSL transformation retrieves content based upon the encoding of report structure.

As mentioned in section 4.2.2 and 4.3.4, the adoption of a wordlist, or use of terminology specific to archaeological science and/or other specialisms would enable particular occurences of species, materials, fabrics and techniques, for example, to be encoded and retrieved as desired.

Alternatively, an <xsl:param name="appendix" select="human remains"> instruction could be used outside the <xsl:template match="/"> element to select this as the default, unless another appendix is selected (Fitzgerald 2004, 124). In addition, <xsl:import> could be used to bring in other content from a separate XSL file (J. Cummings, pers. comm. August 2004).

The display of data in tabular format has been achieved through detailed encoding of individual table cells, rows and labels. The application of these elements was a particularly laborious task when applied manually by the author, as empty cells still had to be identified to achieve the desired result. An automated means of applying markup would make this task much easier. Alternatively, a complex table could be scanned from a word-processed original and inserted into a document as an image.