2.2.2. Correcting the data from the artefact catalogues

While ABBYY Fine Reader produced editable digital text files of the original printed catalogues for each of the sites analysed, the formats of the originals meant that these text files could not be imported directly into the relational databases needed for use in GIS. Each entry in the printed artefact catalogues generally consists of a continuous block of text. In most cases, particularly for Vetera I and Ellingen and most of Oberstimm, each entry contains a single artefact. However, many catalogue entries, across all sites, include a number of artefacts of a particular type, from different provenances, followed by a description of that type. At all sites the entries are arranged into groups according to typological classification, with that classification as the heading for the group. To be useful to the aims of the project, the information contained in the headings and in each catalogue entry had to be re-organised and re-formatted into a series of fields.

The process of re-formatting the catalogue entries involved the search and replace tools in Microsoft Word and Microsoft Excel. Basic formatting of the OCR text files was carried out in Word, with the formatting of the data matrix in Excel. The first step was to split the OCR text output into paragraphs, one for each catalogue entry. Each paragraph, or row, was then split into a series of cells by inserting tab-stops in the Word documents at the point where cell boundaries were required. For example, each of an artefact's catalogue number, description, measurements, provenance, and illustration references needed to be assigned to a separate cell.

This process worked well when reformatting the catalogues for Vetera I, Ellingen and Oberstimm, as these catalogues had abbreviations at the start of each catalogue entry sub-section, indicating the type of information it contained (e.g. 'Dm' for dimensions'). These abbreviations could be automatically replaced with tab-stops using the 'find and replace' tool in Word. Once rows were arranged and cell boundaries defined, the text was simply copied and pasted into a new Excel worksheet. Excel automatically organised the data according to these cells and rows, using the paragraph mark to signify a row-ending (i.e. a single catalogue entry) and tab-stops to signify cell boundaries within a row (i.e. a sub-section of a catalogue entry).

Because the quality and detail of the published artefact catalogues varied considerably, both within and between these sites, this process had to be constantly adjusted. This adjustment was particularly significant for the catalogue entries that do not conform neatly to the general catalogue format of one artefact per entry. This applied, for example, to certain entries for Ellingen where the individual artefacts were not considered important (e.g. one entry with several nails from various provenances) and to the entire Hesselbach catalogue, where a single entry is used to describe a particular type of artefact and includes a number of artefacts from different provenances. In such cases these data needed to be extracted and separated, and then double checked against the original printed catalogue, to make sure each entry was in the correct single row and not split over several rows, and that each row did not contain several entries. This process was much more time consuming for Hesselbach and Rottweil than for the first three sites. However, it provided adequate opportunity to check for misspellings and other transcription errors generated during the OCR process. Only after this double checking could these data be exported into one Excel worksheet as the original file and back-up, and then copied into another Excel file for manual reformatting by 'dragging and dropping', or 'cutting and pasting' the data. For Hesselbach and Rottweil, in particular, this process helped reduce the number of errors that could occur through more automated data translation processes and it ensured the assignment of information to the appropriate cell in the resulting spreadsheet.


