Internet Archaeol 15. Julian Richards. Indexing and access

Indexing and access

It has been suggested that most University archaeology teaching ignores the results of the last ten years because of the difficulty of locating the growing body of grey literature resulting from post-PPG16 fieldwork, and of identifying significant finds (Bradley and Phillips 2002). If archaeology is to make increasing use of Internet technologies for publication and archiving then it is essential that such resources are catalogued so that they can be easily located. However, it is increasingly clear that it will never be possible to bring together or even catalogue all resources in a single physical location or database, such as a National Monuments Record. Fortunately, web technologies are based on the idea of distributed information nodes which can be variously grouped and searched from a single gateway or portal. Where several databases can be combined and searched from a single query they are said to be interoperable.

For most people searching the Internet means using a search engine such as Google. The indices maintained by such engines are based on the automatic keywording and indexing of the titles, text and any hidden metatags in HTML pages. Hits are then ranked according to their popularity or the number of links, or citations, they receive. In Google, for example, a search for 'Archaeology' in January 2003 returns Internet Archaeology as third out of 1,830,000 hits. Search engines are therefore tremendously powerful tools, but most people will also be only too well aware of their disadvantages: the sheer volume of hits, the ultimate lack of human intervention (and therefore also of intelligent indexing or quality control), and the high proportion of dated or broken links. The automated robots and crawlers which build the search engine indices are also limited to open web sites, and to indexing text. They are incapable of searching within online databases, or indexing the content of visual or audio resources (except in so far as captions or metatags may be helpful), or of looking within the content of the increasing number of username and password protected web sites.

For users who know what they are looking for, gateway sites which provide sorted lists of web addresses, and sometimes content reviews, will often be preferable to a search engine. For Britain the CBA's Index to British Archaeology is the starting point for most users; for Europe they may consult ARGE, the Archaeological Resource Guide for Europe, and for the rest of the world, ARCHNET. UK academic users are being steered towards the RDN, or Resource Discovery Network, a series of hubs providing links to quality-assured sites. For Archaeology HUMBUL is the appropriate gateway; for specialist areas within archaeology there is no shortage of sets of links and gateways to cater for every possible interest group. Whilst such lists may be more helpful than those generated by search engines, their disadvantage is the high cost of maintaining a current list of resources. Few organisations have the time or money to maintain an up-to-date catalogue of web sites, especially as the number of sites is still growing at an exponential rate.

If we are going to maintain an effective index of online grey literature, publications, and archives it will therefore be necessary to investigate other solutions and develop interoperability between the various information providers. The purpose of interoperability is to allow users to search effectively across distributed resources, without the chore of undertaking a series of individual searches at the web site of each information provider. Such searches generally interrogate databases of resources and so should allow users to search inside the catalogues of web information providers, unlike search engines, which are restricted to HTML text pages. Therefore they need to be able to cross-search databases which may be held on different hardware and software platforms, which are differently structured, and which may use different terms or (in the case of international searches) even languages. In order to be effective, interoperable searching therefore requires three components:

agreed communications protocols (to allow the databases to talk to one another),
agreed resource discovery metadata or indexing standards (to allow the definition of a shared set of index fields, such as author, title, subject etc.), and
agreed standards for vocabulary control (or where resources use different schemes, a mapping between the different thesauri).

The Archaeology Data Service has taken a lead role in the promotion of metadata standards for the cataloguing of digital resources (Miller 1996; 1999; Wise and Miller 1997). A series of discipline-specific workshops led to the publication of a report (Miller and Greenstein 1997) which recommended the application of the 15-element Dublin Core standard as a cross-disciplinary means of resource discovery. The Dublin Core elements have been implemented as the key fields in the ADS online catalogue ArchSearch. ArchSearch contains over 400,000 index records to sites and monuments of the British Isles. It includes records drawn from the National Monuments Record of Scotland (NMRS), the Excavation Index for England, and a number of Sites and Monuments Records, including those for West of Scotland, Northumberland, Greater London, Clywd Powys, South Gloucestershire, Somerset, and the National Trust, as well as the York and London Archive Gazetteers. Papers published within Internet Archaeology have also been catalogued in ArchSearch. These provide basic metadata records, mapping from the local databases to a number of fields derived from the Dublin Core standard. They include site name and brief descriptions, geographical coordinates, period and subject keywords, bibliographic references, and rights management details. The system was developed through the ASP (Accessing Scotland's Past) project in 1998, which sought to map the NMRS and a number of regional Scottish SMRs to the Dublin Core. In 1999-2000, records for England were enhanced through the RSLP-funded OASIS project, which provided a single concorded database between the former RCHME Excavation Index and the English Heritage and Bournemouth University Archaeological Investigations Project, thereby providing a high level index to the mass of grey literature generated by developer-funded contract archaeology. Subsequently an online OASIS data collection form has been developed to allow contractors and curators to update the index (Hardman and Richards 2003). The records are designed to act as tools for resource discovery, and provide users with details of how to get more information if appropriate. In some cases users are able to follow a live hyperlink, for example from an entry for the NMRS to the live record in the RCAHMS online database, CANMORE. Where the ADS holds a digital archive for the site then users can also drill down within ArchSearch to richer online resources or grey literature. Thus the brokered index records provide a high level backbone to the ADS catalogue such that users should find some information, no matter which site they are interested in, and as more digital archives are made available they may be able to access much more detailed information, down to complete site records.

In principle the implementation of the fifteen Dublin Core elements should allow users to search by resource title, creator, subject keywords, period or location (both expressed as sub-elements of the Dublin Core element coverage). In practice, cross-searching of distinct resources is hindered by the lack of adherence to common standards for vocabulary control. Thus, for example, archaeological period is described according to different classifications in each of the major resources indexed in ArchSearch. Whilst the ability to accommodate different schemes for resource description was one of the attractions for the adoption of the Dublin Core, it will be necessary to develop the use of on-line thesauri to enable effective cross-searching, allowing the user to equate Norse in Scotland with Viking in England with Anglo-Scandinavian in Yorkshire, for example.

The online catalogues of the five AHDS service providers were also linked in a single web gateway which used the Z39.50 communications protocol to allow a single query to be addressed simultaneously to five target databases, each structured in a different way using a different database management system. In response to a query for holdings relating to Shakespeare, for example, the interdisciplinary humanities scholar should recover an electronic text of the Complete Works from the Oxford Text Archive, a video of the Royal Shakespeare Company performance of King Lear from the Performing Arts Data Service, an unattributed portrait from the Visual Arts Data Service, an historical database of 16th century London from the History Data Service, and the excavation archive for the Rose Theatre from the Archaeology Data Service. The implementation of the Z39.50 gateway by Fretwell Downing proved the concept, but initial results were of limited utility because of the patchy nature of AHDS collections, and also problems caused by the lack of vocabulary control in the implementation of the Dublin Core metadata standard.

In some cases simultaneous searching of distributed databases using communications protocols such as Z39.50 may be unnecessary, and may simply impose an unnecessary load on target databases, especially where these only change infrequently. An alternative architecture proposed by the Open Archives Initiative (OAI) goes back to the automated harvesting of metadata from web sites and its storage in a single central database from where it is queried. This is not dissimilar to the approach adopted by the more sophisticated search engines, except that an OAI harvester will gather structured metadata from databases, allowing a deeper and richer level of interrogation. OAI harvesters are generally configured to harvest metadata compliant with the Dublin Core standard; that being developed by the AHDS also assumes that collection level metadata records have been catalogued according to the Dewey subject classification in an effort to try to overcome the different subject classifications adopted by the different disciplines within the arts and humanities. The OAI has developed from the work of Steve Harnad and others in promoting self publication (Harnad 2001). Rather than researchers being dependent upon commercial publishers and thereby having to pay to read the results of the research which has already been paid for through the academic sector, they argue that researchers should mount their papers on their institutional web servers. John Hoopes (2000) has proposed that the Society for American Archaeology should develop a peer-reviewed web gateway for the dissemination of archaeological reports. Metadata tags within the papers could be harvested by OAI gateways and indexed in online virtual library databases. This provides a degree of interoperability but searches will only be as up-to-date as the last harvest as several online databases are not being simultaneously queried, as occurs under Z39.50 technology. Nonetheless several universities are now setting up institutional e-print archives to allow rapid low-cost publication of doctoral theses and research monographs (Pinfield et al. 2002).

The information architecture of the future therefore envisages a number of interoperable gateways and portals providing access to distributed resources. It is likely that multiple gateways and portals will develop, each intended for specific user groups. These groups may be defined by prior knowledge and user needs, such as the academic sector, or the schools sector; by disciplinary area: Archaeology, History, Arts and Humanities in general; or by user interface: map-based searching, text searching etc., but in this vision each information resource needs to be presented only once in order to be available from multiple 'shop windows' (Baker et al. 1999, Section 5). On behalf of the HEIRNET community the ADS has developed HEIRPORT, a Z39.50 portal for the Historic Environment, with initial targets including the ADS catalogue ArchSearch, the RCAHMS's CANMORE, the database of images maintained by the Scottish Cultural Resources Access Network, and the Portable Antiquities Scheme Database of metal-detected finds (Austin et al. 2002). These resources are dynamic and records are constantly being added. Therefore it is appropriate for each resource to live on the server of the resource maintainer and for each to be made available online as a target to any number of portals. However, it is also envisaged that further flexibility will be provided by adding OAI targets to the search options. OAI harvesting provides a less onerous solution to interoperability for smaller institutions and may be particularly appropriate for the indexing of grey literature reports and archives held on the web servers of individual Sites and Monuments records, or even those of archaeological contractors, where it would be adequate to harvest metadata at less frequent intervals.

Within the ADS catalogue the user is able to search for archives by title, period and subject keywords, as well as by clicking on a map of the British Isles for archives within a specific area. Dublin Core metadata also allows users to search for archives which include particular application types, as well as for specific archaeological subjects. If users want a specific archive they can also go to it directly via the Project Archives section. However, although they can download archives files onto their own desktop computers, users are currently unable to search within an archive held by the ADS over the Internet. Thus a user interested in occurrences of a particular pottery type, for example, would have to search sequentially within all the pottery reports held on the ADS server.

Recent developments in means of structuring information provided on the Internet may help resolve this problem and lead to the creation of structured and searchable archives. The majority of current web content is designed for humans to read, not for computer programs to manipulate meaningfully. HTML markup tags indicate titles and headings but they do not encode the content of a web page in any structured fashion. In the Scientific American for April 2001 Tim Berners-Lee describes what he calls the Semantic Web, whereby web robots will be able to harvest structured collections of information.

An important technology for developing Berners-Lee's Semantic Web is already in place. XML provides an eXtensible Markup Language which permits the systematic tagging of the components of a text file (Kelly 1998). For excavation reports, for instance, where there is a fairly standard template for the organisation of information, XML tags could be used to identify each section. XML can also be used to create wrappers around non-text elements, and so images, databases files, CAD drawings and so on, could each be described by metadata held in XML tags. When XML tagged information is displayed over the Internet a browser application may be configured to display each component in a specific way. XML tags may also be used to allow search engines to harvest particular categories of information and to build sophisticated indices, as described above with reference to OAI metadata harvesting. To date only limited use has been made of XML in archaeological publication. Holmen and Uleberg (1996) describe the use of SGML for encoding archaeological archive documents in the Norwegian Museum Project and have subsequently developed XML applications. Gray and Walford (1999) recommended it as a means of structuring archaeological reports and enabling comparison. David Schloen of the University of Chicago has been a major proponent and has even compiled an archaeological DTD called ArchML (Schloen 2001), although his model is heavily based on object-orientated data structures in the context of Near Eastern research. It is unlikely to gain widespread acceptance. The ADS is actively investigating the use of XML markup for the dissemination of online archives, with partners in Europe. As the numbers of digital archives around the world grow, the prospect of interoperable searching of their contents will surely open up archives for universal online access.