[Back] [Forward] [Contents] [Home]

Section 3: Introducing the eXtensible Markup Language and Related Technologies

3.3 The Text Encoding Initiative and the XML version of the TEI Guidelines (TEI P4)

The TEI Guidelines, launched by the Text Encoding Initiative in 1987, are an international and interdisciplinary standard for the encoding of literary and linguistic electronic texts. The Guidelines are commonly used for the markup of electronic texts in the humanities by libraries, museums, publishers and scholars. Although originally created to serve the needs of the research community, they can be used by anyone creating and/or working with electronic texts.

'The TEI Guidelines provide means of representing those features of a text which need to be identified explicitly in order to facilitate processing of the text by computer programs. In particular, they specify a set of markers (or tags) which may be inserted in the electronic representation of the text, in order to mark the text structure and other textual features of interest' (Sperberg-McQueen and Burnard 2002).

The Guidelines are designed for the creation of documents conforming to either SGML or XML, for use in interchange between those using different programs and computer systems over a broad range of applications, and for the local storage of text which is to be processed with multiple software packages requiring different input formats. The specific design goals of the TEI have been that the Guidelines should provide a standard format for data interchange; provide guidance for encoding of texts in this format, support the encoding of all kinds of features of all kinds of texts studied by researchers and be application independent (Sperberg-McQueen and Burnard eds 2002).

When the case-study was undertaken in 2003/4, TEI P4 was the current XML version of the Guidelines, issued in 2001. The full TEI DTD contains c. 450 elements, although it is recognised that most users will not use them all. A broad range of document structure and content may be encoded and there is a detailed, online HTML-format manual explaining the nature of, and relationship between, elements and attributes. A wide variety of projects utilise the TEI Guidelines, many are listed on the TEI website. However, different documents and document types will demand different markup, and therefore different projects will utilise the range of TEI elements according to their own specific needs. Many projects use TEI-Lite, a subset of c. 150 elements that aims to support 90% of the needs of 90% of the TEI user community (Mueller 2002).

The TEI DTD can be freely downloaded using the 'Pizza Chef'; the user has the choice of a series of core and base modular tag sets, which can be combined within the framework of the main DTD. Within each TEI document, the TEI Header provides a means of describing and documenting the text itself; in essence, this is the metadata about the file, its encoding, profile and revision history. All documents comprise the core <text> element, and may contain <front>, <body> and <back> matter, which may be further subdivided by the <div> and <p> elements. The Guidelines specify which elements can contain other elements, and those within which each element may be contained, as well as permitted attributes.

As with any standard or specification, there is a need to review and update. The latest release of the Guidelines under development is TEI P5, which appeared in January 2005 after the practical elements of this study were undertaken. Work on this has been ongoing since early 2002, and at the time of writing, a test release is available. Many ideas are being aired on the TEI-L email discussion list, including a suggestion from Christian-Emil Ore, in May 2004, that a TEI special interest group should be established with the objective of developing a set of recommendations and an extension/module in TEI in the spirit of the CIDOC CRM with a well-defined mapping to the existing CIDOC CRM standard (Ore 2004). Having elements directly related to heritage data would certainly aid the encoding of archaeological reports and archive materials.

Expanding the topics covered by the Guidelines is discussed on the FAQ and Development Activities pages of the TEI website. The chief mechanism for extending the coverage of the Guidelines is by means of chartered TEI workgroups. Proposals from workgroups are made available for public discussion, and reviewed by the TEI Technical Council, before they become part of the standard.

3.3.1 Advantages and disadvantages of using the TEI Guidelines for document markup

The TEI Guidelines have several advantages. They are XML based and therefore platform independent, and are easy to use without special purpose software. Documents may be encoded manually through the use of a plain text editor, such as Microsoft Windows Notepad (see Section 4). There is a widely accepted and maintained architecture and extensive, freely available documentation and links to a wide range of data and tutorials through the TEI website. There is online support through the TEI-L public e-mail discussion list and archive and the Guidelines contribute to open standards and recommendations.

Conversely, however, time is needed to explore and learn the Guidelines. Also, as they are a general purpose recommendation, no one project is likely to use all the available elements. Consequently, the DTD will always be large and will contain a number of unused elements. However, as referred to above (see 3.3), the shorter TEI Lite version is commonly used.

'For anyone seriously concerned about creating an electronic textual resource which will remain viable and usable in the "long-term" (which can be less than a decade in the rapidly changing world of information technology), the TEI's approach certainly merits very serious investigation, and you should think very carefully before deciding to reject the TEI's methods in favour of another apparent solution' (Morrison et al. 2002).

3.3.2 XML editors

Bradley (2002) notes that one of the great strengths of XML is that XML documents can be created using any tool that can produce a text file. All word processors have the option to save as text, or export in ASCII format (see 2.7.1). He sees that 'XML isn't really about markup, but about the concept of unambiguous identification of units of information that form meaningful sequential and hierarchical relationships'.

There are a number of ways in which XML documents may be edited. Six distinct categories of editors have been discussed by Bradley (2002), including form-based editors, word processors, text editors, and XML word processors. He considers efficiency, ease of use, cost of ownership and ability to access document model rules defined in a DTD or schema definition.

There are several specialised XML editors available, which comprise a text window where the XML code can be written and a preview window that will display how the output will display in a Web browser (McGrath 2002, 14). One of the most widely used is Altova's XML Spy and there is also Wattle Software's XMLwriter editor. The TEI has a TEI Authoring Tools Special Interest Group, and, whilst not directly endorsing particular software, maintains a list of commonly used tools. For the purposes of the case study presented in Section 4, the author has manually encoded digital text using Microsoft Windows Notepad. Review of currently available XML editors falls outside the scope of the present study.

[Back] [Forward] [Contents] [Home]

© Internet Archaeology URL: http://intarch.ac.uk/journal/issue17/5/gf3-3to3-3-2.html
Last updated: Wed Apr 6 2005