[Back] [Forward] [Contents] [Home]

Section 3: Introducing the eXtensible Markup Language and Related Technologies

3.1 XML and markup

Having reviewed, in Section 2.1, examples of grey literature dissemination via the World Wide Web using word-processed, HTML and Adobe PDF file formats, this section will explore the eXtensible Markup Language (XML) and related technologies. These offer a range of new possibilities for electronic document delivery and presentation, as well as benefits for longer-term archiving. A number of disciplines, in particular those working with literary and linguistic texts, have been applying XML (and prior to this SGML) encoding to textual material for many years. Some examples of current online initiatives are summarised in Section 3.7. Although XML technology has been available for some years now, it is only recently that its use has begun to be promoted within the discipline of archaeology in the UK.

XML is the core of a suite of layered Web technologies used for the management, display and exchange of data, as well as storage, reuse and repurposing of content. It has been hailed as the technology of the future, and large vendors have supported this by integrating XML into database engines, development tools, Web browsers and operating systems (Box et al. 2000). These uses of XML are often invisible to the end user. Two of the main applications are the publication of webpages and the exchange of information (McGrath 2002, 8).

The World Wide Web Consortium (W3C) publishes a set of recommendations which specify the syntax and semantics of XML. Some of these are still work in progress and others have been revised since their initial appearance in 1998 (Box et al. 2000). The main goal of XML development has been to overcome the limitations of HTML and provide a better means to manage information that the growth of the Internet now demands. The W3C has an XML Core Working Group whose role is to develop and maintain the specifications for XML. Currently, these are XML 1.0 (Third Edition), a W3C Recommendation of 4 February 2004, which supersedes the original published on 10 February 1998. There is also XML 1.1, also a W3C Recommendation of 4 February 2004, which updates XML so that it no longer depends on the specific Unicode version.

It is not intended here to describe the workings of XML in detail, as this has been done in numerous other publications and Web resources (such as Box et al. 2000; Castro 2001; Harold 2001; Harold and Means 2002 and Deane and Henderson 2004). An extensive list of XML applications is maintained by Robin Cover on the XML Cover Pages (Cover 2004) and detailed discussion of XML technologies and their use in the cultural heritage sector is presented in a recent DigiCULT report (Ross et al. 2004).

Box et al. (2000) observe that there are two camps in the XML community, divided between the 'document' and the 'data'. The specific focus of this chapter will be on the use of XML for document markup. This offers great potential for enhancing the accessibility of archaeological grey literature and repurposing of document content, such as for input into other heritage datasets (see Section 4). The term 'markup' was originally used by typists and printers to refer to the annotations or marks used to indicate layout or presentation within a text. However, the term is now used to indicate any means for making an interpretation of a text explicit (Ross et al. 2004, 43). Markup languages were originally developed for printing electronic documents and markup tags allowed processing software to determine the formatting, structure or meaning of encoded data.

Most markup languages have been developed from the Standard Generalized Markup Language, (SGML) an ISO standard since 1986 (ISO 8879). This is a complex, rigid and widely used meta-language, but is unsuited to data interchange on the Internet, as is HTML which is derived from it. As discussed in section 2.7.1, HTML is only intended for displaying documents in an HTML browser. It was originally created using SGML for Web delivery and has a finite, predetermined set of tags containing limited information about what the data represents. There are several different versions of the language, the latest of which is HTML 4.01, a W3C Recommendation of 24 December 1999.

As XML is also derived from SGML, it shares common features with HTML. However, XML is not intended as a replacement for HTML; it is a different concept altogether. The eXtensible HyperText Markup Language (XHTML) is the reformulation of HTML 4 in XML 1.0, a W3C Recommendation of 26 January 2000, revised on 1 August 2002. XHTML is designed to follow the strict grammatical rules of XML and is seen as the next step in the evolution of the Internet, as it has the benefits of XML whilst maintaining backwards compatibility of content (Deane and Henderson 2004).

The process of XML markup involves users creating structured documents and highlighting specific content through the insertion of element tags at appropriate points in the text. The resulting XML document is similar to an hierarchically structured database. XML tags alone, however, comprising elements and attributes, can only describe the data contained within them. An XML document itself contains no information about how to display the data it describes. Other technologies are needed to manipulate and transform XML-encoded data for presentation and exchange. These can include Cascading Stylesheets (CSS), as used with HTML, and other XML-related technologies such as eXtensible Stylesheet Language Transformations (XSLT), specifically designed for use with XML, XPath and XLink, as well as other programming scripts such as Java, Perl and Macromedia's ColdFusion, to name but a few. The Simple Object Access Protocol (SOAP) is commonly used to pass information 'between different applications in a decentralised and distributed environment by combining XML and HTTP to send and receive messages' (Ross et al. 2004, 50).

XML element tags look much like those in HTML, but there are no pre-defined elements. XML is hierarchical and the proper nesting of elements is crucial. The power of XML is that this structure can be used to select specific elements and content for display and output. Working with both CSS and XSLT, formatting can be applied to individual elements and with XSLT, output can also be tailored for specific media (see 4.4). There are rules within the XML specification for the use of syntax and structure that must be followed for an XML document to be well-formed. If it is not, the XML parser will throw up an error. To be valid, an XML document must comply with the rules defined in a Document Type Definition (DTD).

The DTD is used to separate the XML data description from the individual XML documents, thus multiple XML documents can share a single description of the data. The DTD defines rules about the document structure, what elements may be included, how often, and what kind of data and attributes they may contain. The use of a standardised DTD is the key to interoperability as it 'enhances reliable data exchange between diverse applications and organisations since everyone is playing by the same set of rules' (Daly 2004). DTDs are gradually being replaced by XML Schemas which extend the definition capability of DTDs (Ross et al. 2004, 48).

[Back] [Forward] [Contents] [Home]

© Internet Archaeology URL: http://intarch.ac.uk/journal/issue17/5/gf3-1.html
Last updated: Wed Apr 6 2005