The creation of the comparative data within the ArchAIDE Reference Database also allowed statistical relationships between the variables within the data to be considered as part of the work carried out by the Mappa Lab at the University of Pisa. Statistical techniques were used to explore and summarise the main characteristics of the data and identify outliers, trends or patterns. Specifically, Network Analysis was used to identify significant temporal breaks in the data. The network structure was created by linking together locations where ceramics were produced to locations where the same ceramic type was retrieved. This resulted in the formation of 3853 location vertices throughout Europe, the Middle East and North Africa. The structure included 16,820 different edges, joining together 322,764 different data points.

Network analysis allowed the identification of communities within the network, i.e. groups of vertices being densely connected internally but poorly connected externally. Such communities may represent commercial routes adopted by producers, or established for geographical or historical reasons. Temporal breaks were identified by an algorithm minimising the variance within intervals, while maximising the variance between intervals. Production and supply of ceramics had a natural context only in certain temporal intervals, making it possible to distinguish four main periods, characterised by different production centres emerging and declining in the different phases (Italian, South-Gaulish, Rhine productions) and showing different production dynamics.

The main area of interest related to networks, described and handled as mathematical graphs, obtained by linking locations where ceramics are produced to locations where (the same) ceramics were retrieved. This allowed the application of network theory techniques, specifically concerning link analysis, classification and clustering techniques. Roughly speaking, communities, or clusters, are defined as groups of vertices having a higher probability of being connected to each other than to members of other groups; this can of course be computed and checked in terms of network links. Identification of significant communities in the network draws attention to the main 'import-export' systems and their dynamics. Analysing pottery from a spatial point of view allows a better understanding of the economic connections underlying traffic flows.

This analysis is very valuable for archaeologists as it can provide information about aspects of economics and supply. For example, distribution maps permit the representation of areas where a particular ceramic type was in use. When a distribution map is associated with an area of production or origin, it represents the supply movement of pottery. Even where it is possible to describe the correlation between the origin and the destination (occurrence) to indicate a possible trade route, and so better understand the overall mechanism of the distribution process, it is better to analyse data on a larger scale than can be represented by a single site. In this way, as evidence grows, it is possible to create maps to understand the pottery supply and distribution in a region to be investigated.

For quantitative information attached to points (e.g. the number of items on a site) it was possible to create more complex distribution maps, but this must be undertaken carefully, as in many cases there is no information about the size of an assemblage, such that some sites may be over-represented. Moreover, working on the variation of the assemblages over time, it was possible to understand the correlation between origins and occurrences in order to visualise the variation of the main route of commercial exchange over the centuries. In fact, on the basis of the disparities exhibited by the ceramic chronologies, it was also possible to identify temporal intervals illustrating different network behaviours, and then analyse these temporal intervals separately. These analyses show the dynamics of the major production sites and main export areas, the increase and decrease of production, and the spheres of influence of the major production poles over time.

We focused on the following tools:

- Classification and Clustering techniques, used for understanding whether or not some features of the data possess useful classifications in a number of categories/groups, subsequently suggesting meaningful interpretation of such categories.
- Dimensionality Reduction techniques, used to extract a specific combination of features describing the greatest areas of information and variability contained within the data. These specific combinations together provide a way to summarise the data, and the identification of the major sources of variability.

Spatial statistical methods and related predictive modelling were applied directly within a GIS (geographic information system) module. These tools were used primarily to highlight possible patterns within the spatial distribution of data, and to suggest where to look for more data, more information, or optimal strategies to perform testing, resulting in the application of particular clustering algorithms to the obtained graph. Two different alternatives were chosen. In the first algorithm (Newman 2006), detecting communities in networks was approached using a benefit function known as 'modularity' over possible divisions of a network. In this particular algorithm, termed *leading eigenvalue clustering*, the maximisation is solved on the basis of the eigenspectrum of the modularity matrix. The second algorithm (Pons and Latapy 2006), termed *walktrap clustering*, follows a completely different approach, being based on random walks on the graph. The walks are more likely to stay within the same community because there are only a few edges that lead outside a given community. In this way the walktrap algorithm captures the community structure in a network. The two algorithms give very similar results. This indicated that the communities identified were very well defined, and not dependent on the definition of the cluster algorithm.

After applying the clustering algorithm, an additional attribute was added to vertices of the graph, indicating the community. The attribute was also given a colour, so that it can be easily visualised. For the sake of visualisation, the first four communities were prioritised in terms of the number of vertices being the most important. Every other edge/vertex was associated with an additional (poor) community, made by vertices and edges not belonging to the four main communities identified by the clustering. Colours of vertices represent communities identified with clustering.

Another important feature concerning networks was the relative importance of the vertices, such as which vertices were more important and central in the network, and why? In this instance, a measure of such importance can be the *out-degree*, i.e. the quantity of ceramics 'exported' from a specific location, and another is the *in-degree*, i.e. the quantity of ceramics 'imported' in a specific location. Of course these two possibilities give a view of places having produced or imported many ceramics, but networks often have complex structures, so more refined measures of importance were derived. One of the most useful and effective is the PageRank (Brin and Page 1998), an algorithm developed by Google's founders and now widespread. PageRank works by counting the number and quality of links to a vertex to determine an estimate of its importance. The underlying assumption is that more important vertices in a network are likely to receive more links from other (important) vertices. A vertex has a high PageRank score if there are many vertices pointing to it, or if there are some vertices pointing to it that have a high PageRank on their own. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the network. According to these considerations, it was also possible to compute the PageRank score of vertices, giving a further view of most important vertices. Confirming the complicated structure of the network, the results of PageRank differ significantly from those depending on in-degree or out-degree results.

The work carried out underlines how the availability of a high volume of data (which is often rare in archaeology), joined with explanatory analysis, allows new insights into archaeological research. More generally, this process was related both to the digitisation of existing data, such as the comparative catalogues, and to the datification of archaeological processes such as the data created to train the image recognition system within the ArchAIDE app. A key differentiating aspect between digitisation and datafication is related to data analytics: digitisation uses data analytics based on traditional sampling mechanisms, while datafication fits a Big Data approach and relies on the new forms of quantification and associated data mining techniques, permitting more sophisticated mathematical analyses to identify non-linear relationships among data.

Internet Archaeology is an open access journal based in the Department of Archaeology, University of York. Except where otherwise noted, content from this work may be used under the terms of the Creative Commons Attribution 3.0 (CC BY) Unported licence, which permits unrestricted use, distribution, and reproduction in any medium, provided that attribution to the author(s), the title of the work, the Internet Archaeology journal and the relevant URL/DOI are given.

Terms and Conditions | Legal Statements | Privacy Policy | Cookies Policy | Citing Internet Archaeology

*Internet Archaeology content is preserved for the long term with the Archaeology Data Service. Help sustain and support open access publication by donating to our Open Access Archaeology Fund.*