Internet Archaeol 1. Beardah and Baxter. 3 Examples of the use of univariate KDEs

3 Examples of the use of univariate KDEs

3.1 Introduction to examples of univariate KDEs

In this section we will illustrate some possible applications of KDEs to archaeological data. The first example is based on one given in Baxter and Beardah (1996) and illustrates the use of KDEs for comparative purposes. The second and third examples illustrate the use of adaptive and boundary KDEs (see Univariate KDEs - a non-technical introduction). In the first example the window-widths have been chosen using the STE method (see choice of window-width).

The fourth example applies KDEs to "processed" data, and specific details are provided with the example. In general, data may be processed by transformation which, at its simplest, might involve using logarithms of the raw data, but might also involve creating new variables that are combinations of several originally measured variables. One way of creating new variables is by applying techniques of multivariate analysis (Baxter, 1994) to several variables. In essence it means that univariate KDEs can be added to the tools available for exploring multivariate data, and Example 4 provides one such illustration. Extension of the ideas to bivariate KDEs and trivariate KDEs is direct.

Example 1 - Using KDEs for comparative purposes
Example 2 - Adaptive KDEs
Example 3 - Boundary KDEs
Example 4 - KDEs applied to processed (multivariate) data

3.2 Comparative use of univariate KDEs

It is common in archaeological publications to see histograms used for comparative purposes - to examine the distribution of lengths of flints, or some other continuous measure, from several contexts for example. In our view such usage is often unwieldy, since it can be difficult to make the necessary visual comparisons across a page or even across several pages. Often presentation in a single diagram would be preferable, but histograms are not particularly suited to this. Frequency or cumulative frequency polygons might be used, but the former can be distractingly unsmooth, while the latter can be difficult to interpret by the non-specialist. The smoother nature of a KDE makes it well-suited to comparative presentations, and is illustrated here.

Figure 2: KDEs of Calcium Oxide composition of French Medieval glass from four sites

Barrera and Velde (1989) presented data on the chemical composition of specimens of French Medieval glass from eleven sites. The composition with respect to certain oxides is broadly related to different glass-making traditions that may, in turn, have regional and/or chronological associations. Figure 2 compares the distribution of the concentration of calcium oxide for specimens from four sites, using the site numbering from the original paper. Several features are immediately evident. For three of the sites there is a mode at about 12-13%; all the sites, with varying emphasis, have more than one mode; the distribution for site 6 is similar in shape but slightly to the left of that for site 2 etc. The clearly bimodal distribution for site 9 is also associated with larger values, on average, than the other sites. Interpretation of these, and other, patterns is obviously an archaeological rather than statistical problem, but we would argue that the patterns to be interpreted are made much clearer in this sort of diagram than in a collection of histograms.

Figure 3: KDEs of Calcium Oxide composition of French Medieval glass from seven sites (Colouring for Site 1, 2, 6 and 9 is as in Figure 2, in addition: Site 4 - Green, Site 8 - Dark Blue, Site 10 - White)

Figure 2 is essentially that given in Baxter and Beardah (1996). In that paper we restricted comparisons to four sites because we judged that, in a black-and-white publication, inclusion of all sites using different line-types to differentiate between them would have been confusing. Figure 3 is as Figure 2, but using all those sites for which there is a reasonable amount of data. Using both colour and different line types allows, we believe, more information to be usefully included in a single diagram than is possible in black-and-white. Readers are invited to draw their own conclusions about the extent to which Figure 3 is comprehensible, and we invite comment on this.

3.3 Adaptive KDEs

If a distribution has a long tail, possibly containing outliers, this can have an undue influence on the KDE, causing it to be biased. One way round this problem is to use kernels of variable window widths, with the window widths being greater in regions of lower density. This is analogous to the use of varying interval widths in a histogram.

Figure 4: Example of an Adaptive KDE

The MATLAB routines used to generate Figure 4 are based on the description given of adaptive KDEs in Silverman (1986, pp.100-110). Essentially, at each data point the window width, h, is rescaled by a factor that reflects the density at that point. This density is initially unknown and is determined from a pilot estimate. In the example to follow an initial STE estimate (see choice of window-width) was used to obtain the pilot estimate and value of h.

In Figure 4 the data on sixty Bronze Age cup diameters are also used in Figure 28, Figure 29 and Figure 30. It compares the original STE estimate with the adaptive estimate derived from it. It can be seen that the bimodality of the data is accentuated, while the bump to the right, arising from some outlying values, is smoothed out.

3.4 Boundary KDEs

One possible use of KDEs in archaeology that we have exploited is to examine the distribution of elements or oxides in artefact compositions. The statistical methods used to process such data often assume that the distributions are normal or log-normally distributed, and inspection of KDEs allows a quick informal assessment of this assumption. The measurements involved are bounded below by zero and for some oxides may be close to this boundary. This gives rise to the problem, noted in the section on Univariate KDEs - a non-technical introduction, that the basic KDE will include negative values that are impossible in practice.

One way round the difficulty is to use a boundary kernel. Our MATLAB routines implement a method described by Jones (1993) that takes into account the proximity of the boundary and ensures that it is not crossed. The kernels associated with data points near the boundary may be asymmetric and can take on negative values. The example used to illustrate this is based on one in Beardah and Baxter (1996), and uses data from the same source as that used for Figure 2 and Figure 3. In those figures, data on the calcium oxide content of specimens of French Medieval glass were used. In Figure 5 KDEs for the soda content are shown. Here, to illustrate the method, a normal-scale estimate of h is used, and (see choice of window-width) clearly includes values that are negative, in contrast to the boundary kernel. Also shown is a "reflected" KDE, obtained by doubling the amount of data by reflecting it at the origin and using the usual KDE estimate.

Figure 5: Boundary KDEs of Soda content of French Medieval glass

3.5 KDEs applied to processed data

It is possible to apply univariate KDEs to more complex data that have been "processed" in some way, and we give one such example in detail to illustrate the possibilities. Data on the chemical composition of 271 specimens of Early Medieval glass from excavations at Southampton have been used (Heyworth, 1991). A principal component analysis (PCA) was undertaken using standardised values of eleven major/minor oxides of the composition.

PCA is a standard multivariate technique for investigating pattern in complex multivariate data (Baxter, 1994). It transforms the 11 original variables into 11 new variables that are linear combinations of the originals. The hope is that most of these new variables are `unimportant' so that any structure in the data, such as groups, can be found by looking at plots based on the first two or three new variables or components. This hope was realised in the present instance; after eliminating seven clear multivariate outliers a bivariate plot of the first two components showed that there were two main concentrations of points in the data (5.6). These two concentrations were strongly associated with two colour groups - light-green and light-blue - identified by Heyworth (1991) that, in turn, are strongly related to the glass chemistry. In particular there was a very high correlation (0.89) between the first component and the concentration of iron. Since the structure in the data seems to be largely associated with the first component, univariate KDEs can be used to display this information in various ways.

The distribution of scores on the first component has longish tails so an adaptive KDE (3.3) was used and the result is shown in Figure 6.

Figure 6: Adaptive KDE of First Principal Component - Southampton Glass data

The modes to the left are associated with light-green glass that tends to have a lower iron concentration; the larger peak to the right is associated with light-blue glass. The disparity in the height of the modes simply reflects the fact that there are more light-blue specimens. To get a clearer picture of the colour differences, where the visual impression is not affected by sample size differences, separate KDEs can be plotted for each colour group on the same graph. This is done in Figure 7, using adaptive KDEs, with the first component replaced by the highly correlated iron concentration. There is some overlap in the distributions, but the graph shows clearly that the light-blue glass tends to have a higher concentration.

Figure 7: Adaptive KDEs by glass colour - Southampton glass data

Further analysis of this data set is given in 5.6 where bivariate kernel density estimates are used. The idea of separate plotting for different groups, introduced above, will be extended to the bivariate case.

PREVIOUS NEXT CONTENTS HOME