Section 3: Introducing the eXtensible Markup Language and Related Technologies

3.7 Uses of XML for the provision of online content

This section summarises a number of examples of how XML technologies are being used for document encoding and transformation, both in the humanities and in archaeology. An extended discussion may be found in Falkingham (2004).

The Perseus Project is an evolving online digital library of resources for the study of the ancient world and beyond, developed at Tufts University, USA. It comprises a textual and visual collection of a range of materials on subjects including the Classical Greek world and late Republican and early Imperial Rome. The project has encoded several thousand documents using SGML and XML markup, using the TEI Guidelines. The application of markup has focused on the basic document structure, including chapters, sections, headers, notes and blockquotes, identifying individual bibliographic citations and linking these to formal bibliographic records for author and work. Most foreign language quotations, letters and extracts of poetry have been marked up by hand (Crane 2000).

Rydberg-Cox et al. (2000) have identified one of the challenges in building a digital library of this type as being the ability to apply this approach to a large number of documents marked up to varying levels of specificity, tagging conventions, and document type definitions (DTDs). Whilst the use of a variety of DTDs and markup practices may make the encoding of individual documents easier, Rydberg-Cox recognises that this can raise barriers to resource discovery within a digital library. To address this, an adaptable toolset has been developed for the Perseus Library. The Perseus XML Document Manager is able to process encoded texts and images, extract and index structural and descriptive metadata, deliver document fragments on demand, link geospatial data to a GIS, and to support other tools that analyse linguistic features and manage document layout. In this way, many operations can be performed on the data to establish automatic connections between different and otherwise isolated parts of the collection and to deliver content on a variety of platforms (Crane 1998; 2000).

The LEADERS Project has developed a set of generic computer-based tools to deliver integrated user access to archives over the Web (Leaders nd a; nd b). The project aims to build archive finding aids to identify, manage and locate records, that will be linked to encoded transcripts and digitised images of paper-based archival materials. XML and the TEI Guidelines are being used to encode archive documents, the Encoded Archival Description (EAD) to encode archive finding aids, and the Encoded Archival Context (EAC) to encode administrative and biographical information about associated organisations and people (Sexton 2002; 2003). The project has developed a demonstrator application using seventeen documents from the George Orwell Archive and the University College London Archive and this has been subject to detailed user testing (Sexton 2004).

The National Database Project of Norwegian Universities: The Museum Project has converted information from a variety of disciplines and sources from paper-based archives into electronic form to enable interoperability via the Internet. Based at the University of Oslo, and begun in 1992, this is a collaborative project between the Faculties of Art in the Norwegian universities each of which housed archaeological museums felt to be in need of revitalisation (Ore and Eide 2001). The project aims were to enhance accessibility for a range of users and create a national database for language and culture, to make possible multidisciplinary studies of material relating to subjects such as folklore, archaeology, Runes, place names and folk music (Holmen and Uleberg 1996b). An object-oriented database and geographical information system have been devised to combine textual information with drawings, photographs, maps and sounds, utilising free text, hypertext, scanned documents and bitmap images. Work began by converting artefact catalogues into machine-readable formats using optical character recognition techniques and encoding these using SGML and TEI-conformant encoding schemes (Holmen et al. nd, fig. 1). Handwritten records were manually transcribed. Of particular debate was the approach to markup and the terminology to be used to describe artefacts recorded over a period of 170 years. It was decided not to reclassify or modernise the original data, but to add interpretive SGML tags to solve the problem of old and new terms defining the same artefact classes. Holmen et al. (nd) realised that the encoding of content in the 20th century reflects our view of what constitutes important archaeological information, and does not necessarily coincide with the conception of the 19th-century author of the original material. To achieve a common vocabulary a standardised system of markup was devised with a single DTD that can be mapped to the CIDOC CRM (Holmen and Uleberg 1996a). The SGML tags used create the link between the text and the database, as almost every aspect of the catalogue was encoded, including location, material and artefact types, decoration, and date (Holmen and Uleberg 1996b). What is interesting about this project is that a previously inexperienced workforce, with no prior knowledge of archaeology, was trained to perform the text analyses and encoding of the textual material, with different groups working in different locations. This was aimed at converting a large amount of data, in a relatively short timescale. Between 1992 and 2000, almost 30,000 printed pages of text were converted and tagged in SGML, and latterly XML (Holmen et al. nd, fig. 2).

The XSTAR Project aims to create an electronic publication and research tool, comprising a database and related user-interface software (Schloen 2001). These have developed out of research into the electronic publication of Ancient Near Eastern texts and a desire to integrate texts with their archaeological and geographical contexts, for the purposes of large-scale comparison and querying (Jones and Schloen 1999). The hierarchical database structure is defined by the Archaeological Markup Language (ArchaeoML), devised by Schloen (2001) to enable the data to be uploaded onto the Web as a platform-independent database application. ArchaeoML is formally defined in terms of twenty-four interrelated schemas, details of which, along with a full element list, full documentation and source files currently under development are available via the Project website. The XSTAR schemas do not define a purpose-specific data-exchange format, such as the TEI Guidelines, but define a general-purpose database structure. At the time of writing, the online demonstration version of a sample XSTAR database is currently under development, using platform-independent Java user-interface software for viewing and querying XSTAR databases. It is intended that applications may be downloaded onto any researcher's computer, regardless of operating system. This software will interact with XSTAR data by means of the Tamino XML database management system to deliver information as XML text over the Web. It will be possible to search large quantities of structured information using a variety of criteria. Alternatively, the user will be able to navigate manually through large numbers of interlinked maps, images, and documents (Schloen 2001).

The Oxford Text Archive (OTA) has been collecting, preserving and redistributing electronic texts to the academic community for over twenty years. The preferred format for archiving and distributing literary and linguistic texts is TEI-conformant SGML. However, the OTA website provides users with a choice of formats in which to download texts, including ASCII, HTML and XML. Titles may be browsed alphabetically, or through full-text and advanced searches; specific titles and authors can be found. The OTA presently distributes a collection of over 2500 literary and linguistic resources in more than twenty-five different languages.

