Introduction

This document presents the results from my Topics in Machine Learning final project at Brown University, written in December 1996. The project was a “bakeoff” for machine learning algorithms that learned classification problems (categories instead of continuous functions). I discuss my implementation and results for Nearest-Neighbor and k-Nearest-Neighbor, two well-known machine learning algorithms, on three classification datasets. In addition, I introduce and discuss the Agglomerative-Nearest-Neighbor algorithm, a variation of nearest-neighbor that clusters points together.

Datasets

Three classification datasets from the UC Irvine Machine Learning Repository [3] were chosen for testing:

Wisconsin Breast Cancer from Dr. William H. Wolberg at the University of Wisconsin Hospitals. Records in the dataset represent the results of breast cytology tests and a diagnosis of benign or malignant. Attributes include clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. All values were integers in the range of 1-10.
BUPA Liver Disorders from Richard S. Forsyth at BUPA Medical Research Ltd. Each record in the dataset constitutes the record of a single male individual. The six attributes in each record are results from blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption, as well as the number of drinks per day.
Wine Recognition Data from Forina, M. et al at the Institute of Pharmaceutical and Food Analysis and Technologies. Each record represents results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wine.

Nearest-Neighbor

The nearest-neighbor algorithm is extremely simple to implement and leaves itself open to a wide variety of variations. In brief, the training portion of nearest-neighbor does little more than store the data points presented to it. When asked to make a prediction about an unknown point, the nearest-neighbor classifier finds the closest (according to some distance metric) training-point to the unknown point and predicts the category of that training-point.

Put more formally, given a collection of points S and a query point q, what is the point x closest to q in S? The nearest-neighbor classifier predicts that the category of q is the same as the category of x.

Although nearest-neighbor can be efficiently implemented with a Vornoi diagram, I took the brute force approach. This method computes the distance from q to every point x_i in S.

Distance Metric

I chose to use plain Euclidean distance for the distance metric in all three algorithms. The distance from point x to q (where both x and q are d-dimensional vectors) is:

d(x, q) = sqrt((x – q)^T(x – q))

K-Nearest-Neighbor

The k-nearest-neighbor algorithm is a variation of nearest-neighbor. Instead of looking at only the point x that is closest to q, we instead look at the k points in S = {x₁, x₂, … , x_k} that are closest to q. Since we know the categories of each of the k nearest points, we find the majority category in this subset. If it exists, the query point q is predicted to be in the that category. If no majority exists, the result is considered inconclusive.

K-nearest-neighbor is more general than nearest-neighbor. Put another way, nearest-neighbor is a special case of k-nearest-neighbor, where k = 1. For the tests in this project, I chose k = 3, k = 5, andk = 7.

Agglomerative-Nearest-Neighbor

The agglomerative-nearest-neighbor algorithm is a local learning algorithm for classification. The idea is to capture the essence of nearest-neighbor without having to store all of the points. As the classifier sees new training points, it clusters points points together that predict the same category. The approach is named agglomerative-nearest-neighbor because points are put together in clusters as the learning process goes on. Once the clusters are formed, the prediction for unknown points (discussed below) is very similar to nearest neighbor.

Training

A cluster x_i = (c_i, r_i, g_i) where c_i is the center of a hypersphere with radius r_i, predicting category g_i. During the training stage, a new cluster (p, r_new, g) is formed for each training data (p, g), where r_newis a default radius for new clusters. For my tests, I arbitrarily chose r_new = 1.0.

Once initialized, the new cluster is then compared to every cluster in the system. Two clusters that are close together (within some epsilon of each other according to the distance metric d) and classify the same category are combined to build a new aggregate cluster (a hypersphere of a larger radius). If the new cluster is not close to any other clusters, it remains by itself. For my tests, I chose epsilon = 2.15.

When two clusters are merged together, the center and the radius are adjusted to be the average of the two clusters. If the center of cluster x_a is inside another cluster x_b, the inner cluster x_a is removed from the system and the outer cluster x_b remains unchanged.

Pre-processing and normalization

Since the Euclidean distance function treats every dimension equally, it may be necessary to normalize the data. By analyzing all of the training data in a pre-processing stage, we can determine the range of every attribute and transform the entire dataset appropriately [1]. Normalizing every point involves scaling every attribute between 1 and n. I chose n = 10.0 to facilitate merging of clusters. Forepsilon = 2.15, n << 10.0 would result in overlapping of conflicting classifiers (discussed below) and n >> 10.0 would result in degeneration to nearest-neighbor (also discussed below).

For some of the test datasets, normalization greatly decreased the misclassification rate and for others it actually contributed to more misclassified points. The wine recognition dataset, for example, had a wide variation among the ranges of its thirteen attributes. One attribute had values in the range of 0.1 – 1.7, while another ranged from 278 – 1403. Without pre-processing, the large-valued attribute dominated all distance comparisons, so performance was poor.

Normalization of the bupa liver disorders dataset, on the other hand, increased the misclassification rate. In this case, one or more attributes with large values probably contributed more to the classification of the data than the other attributes. By scaling all attributes equally, the relative importance of these attributes was diminished and other attributes were given equal weight in the clustering.

Advantages: space-efficiency

The goal of agglomerative-nearest-neighbor was to take advantage of the redundancy of data and save memory over nearest-neighbor. This goal was achieved with varying success depending on the dataset.

For the breast-cancer-wisconsin dataset, agglomerative-nn stored an average of only 45% of the points needed by nearest-neighbor and k-nearest-neighbor. For the normalized bupa and wine datasets, agglomerative-nn stored an average of 30% and 92%, respectively, of the points needed by nn and k-nn. The lesser memory savings for the normalized wine dataset are likely due to the compactness of that data; there was little redundancy to be removed by clustering data points.

Space efficiency suffered on the datasets that were not pre-processed. For bupa, 99% of the data was needed, and for wine 100% of the data was needed. As discussed below, agglomerative-nn degenerated into nearest-neighbor in these cases because the epsilon was set too high (clusters never merged). Looking at the error rates for these sets shows that agglomerative-nn behaved quite similarly to nearest-neighbor.

Problems with the approach

Space-efficiency and accuracy tradeoff

Since the agglomerative-nn algorithm throws away some data, performance cannot be any better than nearest-neighbor. In other words, the algorithm trades some accuracy for space-efficiency. This effect is most notable in the normalized bupa dataset: as noted above, agglomerative-nn used only 30% of the data-points needed by nearest-neighbor, but misclassified over 10% more query points. See theresults section for specifics.

Overlap of clusters

In the current implementation, clusters occasionally grow even when doing so overlaps with nearby clusters that classify differently. Suppose two nearby clusters x_m and x_n that classify into categories C_mand C_n. Clearly agglomerative-nn wouldn’t merge them. However, what happens when we introduce new training data x_new for category C_m that is right between x_m and x_n and is within the epsilon of both? Since x_m predicts the same category as the new data, the radius of x_m‘s hypersphere grows a little to accommodate the new data. Eventually, with enough x_new‘s of this type, either x_m or x_n could grow so that it overlaps the other nearby cluster. Then, query points in the area of the overlap could easily be misclassified.

One solution to the overlap problem might be to avoid merging clusters that are close to other clusters that classify differently. That is, if x_new causes x_m to grow, check to see if x_m overlaps x_n. If so, leavex_new as a separate cluster. This avoids overlap and blurring of the area of expertise.

Degeneratation

Another interesting problem is that agglomerative-nn degenerates into nearest-neighbor when all of the training points are far away from each other. That is, if no two points are within some epsilon of each other, they stay as separate clusters instead of being merged together. For every dataset there’s probably an ideal epsilon, but if it is chosen to be too large, clusters will never merge and we’re left with nearest-neighbor.

If epsilon is chosen too low, then every point of a given class is merged with every other point of that class; the system degenerates into c clusters where c is the number of categories in the dataset. Even for a linearly separable dataset, overlap of clusters is bound to occur because clusters are hyperspheres, not ellipsoids or rectangular prisms.

Test results summary

Eight training and test runs were performed for each algorithm using a randomly chosen 90% of the data for training and 10% for testing. With the bupa and wine datasets, tests were run with both the plain data as it appeared in the dataset, as well as normalized.

The average misclassification rates for several datasets on each of the three algorithms are reported below. As predicted, the agglomerative-nn approach had a higher misclassification rate than nearest-neighbor for all datasets.

breast-cancer-wisconsin, breast cancer data
- nearest-neighbor: 3.13%
- 3-nearest-neighbor: 2.02%
- 5-nearest-neighbor: 1.84%
- 7-nearest-neighbor: 1.84%
- agglomerative-nearest-neighbor: 5.7%
bupa, liver disorders
- nearest-neighbor: unweighted 40.81%, normalized 37.5%
- 3-nearest-neighbor: unweighted 37.5%, normalized 39.34%
- 5-nearest-neighbor: unweighted 39.34%, normalized 38.24%
- 7-nearest-neighbor: unweighted 35.29%, normalized 40.07%
- agglomerative-nearest-neighbor: unweighted 38.6%, normalized 48.53%
wine, wine recognition
- nearest-neighbor: unweighted 24.26%, normalized 2.21%
- 3-nearest-neighbor: unweighted 27.21%, normalized 2.94%
- 5-nearest-neighbor: unweighted 24.26%, normalized 1.47%
- 7-nearest-neighbor: unweighted 24.26%, normalized 1.47%
- agglomerative-nearest-neighbor: unweighted 33.82%, normalized 2.94%

Conclusions

The agglomerative-nearest-neighbor algorithm worked quite well with very redundant datasets. Although the average misclassification rates were higher with agglomerative-nn than nearest-neighbor for the breast cancer and wine recognition datasets, the memory savings were quite noticable. Other redundant datasets would probably see similar improvements.

Unfortunately, agglomerative-nn is substantially slower than nearest-neighbor in the training stage, since nearest-neighbor merely stores points during training. Since agglomerative-nn compares the new cluster to every other cluster in the system, worst case time-complexity is O(N²). For large datasets (more than 1000 records), agglomerative-nn is too slow to see real-time results. With nonredunant datasets, an efficient nearest-neighbor implementation would be preferred over agglomerative-nn for both speed and accuracy.

Choice of parameters is an area that needs more exploration. Although the choice for epsilon = 2.15 was a function of the values used for r_new and the normalization factor (1.0 and 10.0, respectively), these values were chosen somewhat arbitrarily. Varying r_new and the normalization factor with the number of attributes in a dataset might lead to better accuracy because as dimensionality grows, a radius of 1.0 becomes less and less significant.

Some more careful weighting of data in the pre-processing stage could also lead to better accuracy in all three algorithms. Heavily weighting those attributes that more greatly contribute to the classification of points would skew the distances more in the dimension that matters most.

References

Bishop, C. 1995. Neural Networks for Pattern Recognition, p. 6-7. Oxford: Clarendon Press.
Schaal, S., and Atkeson, C. 1996. From Isolation to Cooperation: An Alternative View of a System of Experts. In: Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., eds., Advances in Neural Information Processing Systems 8, pp. 605-11. Cambridge, Massachusetts: MIT Press. ftp://ftp.cc.gatech.edu/pub/people/sschaal/schaal-NIPS95.html.
Merz, C.J., & Murphy, P.M. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.

Acknowledgements

Thanks to Prof. Leslie Pack Kaelbling for her suggestions on the paper and guidance in the machine learning seminar. Thanks also to Fred Piccioto and Jason Lango for algorithmic suggestions, and to Mike Cafarella for coining the name “agglomerative-nearest-neighbor”.

Source code (in Common Lisp)

learning.lsp, the nearest-neighbor, k-nn, and agglomerative-nn learning algorithms.
test.lsp, automated tests for the three algorithms.
datapoint.lsp, definitions for the data structures used in learning.
fileio.lsp, code to snarf up text files.

Entire project

You can grab all of the code, perl scripts for munging data into lisp-readable format, datasets, and results in a gzip’ed tarfile:

learning.tar.gz (645K)