The procedures for converting the printed artefact catalogues into a format that is usable in GIS involved the following main steps: capturing the data; correcting them; tabulating them in a spreadsheet; importing them into a database; and querying them.
The first step was to scan the artefact catalogues into TIFF image files, as multiple-page images at a minimum resolution of 300 dpi. These TIFF files were then converted into digital text using ABBYY Fine Reader Optical Character Recognition (OCR) software, with the OCR output being saved as Microsoft Word files. The Fine Reader software is able to recognise and convert characters in most languages, including German, into electronic form and to check the spelling of the resulting text. In general, the OCR software produces reasonably reliable data from TIFF image files scanned at 300 dpi, but the software's reliability decreases significantly if the scans are produced at a lower resolution. Transcription errors still occurred, but the output could be easily checked against the original catalogue because errors or unrecognisable text are highlighted by the OCR program. The production of multiple-page scans and the capacity of Fine Reader to process these multiple-page TIFF files rendered this part of the process was relatively speedy and effective.
© Internet Archaeology/Author(s)
URL: http://intarch.ac.uk/journal/issue24/6/2.2.1.html
Last updated: Mon Jun 30 2008