## 9 Kernel Density Estimation Demonstrations - Help Sheet

### Important Note:

What follows is an on-line copy of Help sheet notes designed to introduce the MATLAB Kernel Density Estimation Toolbox. The screen shoots you see here are just that - there is NO interactivity. The software described here is free to anyone interested enough to download it; however to run the software you will of course need MATLAB. All I ask in return is that you
• acknowledge use of this material (if you find a use for it);
• let me know via email any bugs or suggestions for improvement.
Christian C. Beardah (25/3/96)

### 9.1 kdedemo1 - The Univariate case

The figure below shows a typical screen shot of the univariate demonstration software. This demo is invoked by typing kdedemo1 at the MATLAB prompt. Following the figure are some suggestions of things to try using the demo.

Figure 31: One dimensional KDE demonstration

1. Click on the Data menu with the Left Mouse Button (LMB). Choose another Dataset. The example datasets (not all archaeological!) are as follows:

(a) Suicide Times: the length of treatment spells (in days) of 86 control patients in a suicide study;

(b) Old Faithful: 107 measurments of the eruption length (in minutes) of the Old Faithful geyser in Yellowstone National Park, USA;

(c) Buffalo Snowfall: Taken from Silverman (1986). 63 measurements of total yearly snowfall (inches) taken at Buffalo, USA;

(d) Pot Diameters: the rim radii of 81 Danish neolithic pots;

(e) Hairpin Lengths: the length of 224 Romano-British hairpins from southern Britain;

(f) Cup Diameters: the rim diameters of 60 Bronze Age cups from Italy;

(g) Nickel: The percentage nickel content of 361 French medieval glass fragments;

(h) Manganese: The percentage manganese content of 361 French medieval glass fragments;

(i) N(0,1): 500 observations from a standard normal density (randomised);

(j) NMD: 500 observations from the Normal Mixture Density used by Wand and Jones (1995) (randomised).

(k) Small: A small dataset of 7 observations which can be used in conjunction with the "Add bumps" button to show how KDEs are formed as the sum of bumps.

2. Kernel selection: The default kernel function is the normal probability density function. To change the kernel, click with the LMB on the small downward pointing arrow to the right of the word "Normal". A menu will appear. To select another kernel function simply click on the name of the kernel function. To see the shape of the kernel function choose the "Small" dataset and add the bumps.

Figure 32: A KDE as a sum of "bumps" (Laplace kernel)

3. Adaptive KDEs: You can turn the adaptive method on and off by clicking in the box to the left of the word "Adaptive". If the box is empty the adaptive method is NOT used. Again, choosing "Add bumps" will show how the Adaptive KDE is formed.

4. Selecting the value of h: To vary the value of the smoothing parameter h, click somewhere on the numeric value of h (in the box at the top right of the demo). A small cursor will appear. You can now delete the current value and type in your own value using the keyboard. Try increasing or decreasing the value from that which is displayed above the KDE plot. For example, for the Cup diameters data, the default value (calculated using a normal scale rule) is 2.46, so try values of h=3, 2, 1.5, and 1 to see what effect the smoothing parameter has. Note that if the data are multimodal then the normal scale value of h will tend to oversmooth. In such cases it is particularly important to reduce the value of h.

5. Automatic selection of h: Many methods of automatically selecting an "optimal" value of h have been developed. Several of these have been implemented and are available to try within the demo. The default method of h selection is the normal scale rule, which will tend to oversmooth non-normal data. Clicking on the small arrow to the right of "Normal scale" will reveal a list of other h selection strategies, including Direct-Plug-In (DPI), Solve-The-Equation (STE) and various Cross-Validation (CV) methods. At this time, the methods based on Cross-Validation are rather slow; hopefully this will be improved in the future. While there is no overall "best" method of automatic h selection, Wand and Jones (1995) suggest that the STE method offers good overall performance.

6. Canonical Kernels and fixing the value of h: Different Canonical kernels give similar KDEs for the same value of h. The same is not true of standard kernels. This can be seen by fixing a value of h, turning Hold on, and plotting several KDEs with different kernel functions. With standard kernels very different KDEs will result. However, using canonical kernels the resulting KDEs should all appear roughly the same.

7. Producing multiple plots and changing the print style and colour: Say you wanted to compare the KDEs obtained using different values of h by plotting them on the same axes. To do this, obtain the first plot (for example use h=1 for the Cup diameters data) then click in the box to the left of the word "Hold". Subsequent plots will now be added to the same axes allowing comparisons to be made directly. To see this, try changing the value of h to 2. You can also change the plot style of the next plot by selecting an entry from the "Print style" menu. This is useful when a colour printer is not available. A recent addition (not illustrated on the figures in this document) is a "Colour" menu, which allows you to choose the colour of the next plot.

8. Plotting Histograms: This can be achieved using the features in the following box

Clicking the check box turns the histogram plot on and off, while you can alter the number of bins by entering a new value in the appropriate place. Note that the histogram is added to the current plot.

9. Sampling: You can take samples from a dataset and study the difference between the KDE of the sample and the KDE of the whole using the following box

Ssize controls the size of the sample while the check box lets you use sampling with or without replacement. The default is sampling without replacement. Using this feature together with hold on lets you plot multiple KDEs on the same axes for comparison.

10. Exiting: Click on the "End" button to exit.

### 9.2 kdedemo2 - The Bivariate case

The Figure below shows a typical screen shot of the bivariate demonstration software. This demo is invoked by typing kdedemo2 at the MATLAB prompt. Following the figure are some suggestions of things to try using the demo. The default dataset is "Cup Diameter/Height". The five datasets supplied are:

(a) Glass composition: Na2O and MgO composition of 361 samples of French medieval glass.

(b) Artifact location: x and y co-ordinates describing the location of 276 bone splinters (Mask site data).

(c) Cup diameter/height: Neck diameter and height of 60 Bronze Age cups from Italy.

(d) Leicester/Mancetter: The first two components of a PCA based upon chemical composition of 105 specimens of Romano-British waste glass.

(e) N(0,1): 500 observations from the bivariate standard normal density.

Figure 33: Two-dimensional KDE demonstration

1. Data, Kernel, Adaptive, Hold, End and Help functions are similar to those in the univariate demo and are not described further here. Note that "Hold" only works for 2D plots such as contours and percentage contours.

2. Altering the angle of view: Two sliders perform this function. You can alter the viewing angle to the left and right, or up and down. An overhead view is often quite useful and is obtained by sliding the vertical slider all the way to its highest position.

3. Altering the colour scheme: Many different colour schemes are provided via the Colour menu. Use this in the same way as the Data menu. If you don't like using colour, select "white".

4. Smoothed shading: Clicking on the check box to the left of the word "Shading" causes the 3D plot to be smoothed. This can be quite effective when viewed from above (see point 2 above). Clicking the box again turns off the shading.

5. Mesh Granularity: By clicking here you can alter the size of the grid on which the KDE is plotted. (Type in a new value.) We have used a default value of 32 as a compromise between speed and realism. Increasing this value to say 64 will result in a more pleasing image, but longer computing times. The minimum value supported is 8.

6. Altering the value of h: This works in the same way as for the univariate demo, except that as the data are bivariate, two smoothing parameters are used, one each for the x and y directions. Click here to alter these values. For example, consider the Cup data. The normal scale values of the smoothing parameters are 2.379 and 0.5091. To change these to say, 2 and 1 respectively enter [2,1]. (The square brackets are necessary, the comma is optional - just use a space if you wish.) Entering a single number, e.g. 1.5, assigns both smoothing parameters to that value. Automatic selection of the h values is again supported via the appropriate pop-up menu, though the approach taken is one of calculating separate h values for each variable using univariate techniques.

7. Contouring: A contour plot of the KDE can be obtained by selecting "Contour" from the pop-up menu in the top left corner of the figure. To return to a Surface view, select "Surface". You can add a scatter plot of the data to the contour plot by clicking on the "Scatter" check box.

8. Percentage Contouring: Because this technique requires estimates of the height of the KDE function at all the data points, it can be time consuming on slower PCs or for large datasets. When the calculation has been performed a percentage contour plot is presented. To change the position of the contour lines, use the Contour %'s box in the lower left of the figure. For example, if you wanted to draw the 50 and 100% contours, enter [50,100]. To draw the 10%, 20%, ..., 90% contours enter [10:10:90].

### 9.3 Importing your own data

In either the uni- or bivariate case it is possible to import your own data into the demo routines. For example, at the MATLAB prompt typing

>> data=randn(100,1);
>> kdedemo1

will invoke the uni-dimensional demo with a random sample of 100 data points from a standard normal density. If you have data stored on disk in ASCII format, use the load command (type help load at the MATLAB prompt for details) to import the data into MATLAB. For use with the KDE demos your data must be stored in a matrix structure called data.

Since MATLAB is always running in the background you have a great deal of flexibility in influencing the demo routines. The following show just a couple of the things that are possible.

Example 1: Changing the axis limits.

When you have a 2D image in the demo Window (this will occur only when you're using the Bivariate demo) you can alter the axes by clicking in the MATLAB command window and using the axis command to define the x and y axis limits. This command works like so:

>> axis([xmin xmax ymin ymax])

where xmin, xmax, etc. are numerical values separated by spaces.

Example 2: Investigating subgroups within data.

The Leicester/Mancetter dataset has been divided into two groups based upon site of origin. These two groups have been stored in lmgp1 and lmgp2. At the MATLAB prompt enter

>> data=lmgp1(:,1:2);

This assigns columns 1 and 2 of the dataset lmgp1 to data. Returning to the demo Window, select "Percent Contour" to see a contour plot of the dataset. Now click on "Hold" and select a new "Print style" from the appropriate menu. Returning to the MATLAB command Window type:

>> data=lmgp2(:,1:2);

Now return to the demo Window and click on "Normal" from the Kernels pop-up menu. This forces the routine to do a calculation based on the new dataset and the percentage contour will appear on the same axes as that for the first group of the dataset.

Figure 34: Investigating subgroups within data