Nearest-Neighbor Machine Learning Bakeoff
This document presents the results from my Topics in
Machine Learning final project at Brown University, written in April
1997. The project was a "bakeoff" for machine learning algorithms that
learned classification problems (categories instead of continuous
functions). I discuss my implementation and results for Nearest-Neighbor and k-Nearest-Neighbor, two well-known machine learning
algorithms, on three classification datasets. In addition, I introduce
and discuss the Agglomerative-Nearest-Neighbor algorithm, a
variation of nearest-neighbor that clusters points together.
Three classification datasets from the UC Irvine Machine Learning
Repository [3] were chosen for testing:
- Wisconsin Breast
Cancer from Dr. William H. Wolberg at the University
of Wisconsin Hospitals. Records in the
dataset represent the results of breast cytology tests and a
diagnosis of benign or malignant. Attributes include
clump thickness, uniformity of cell size, uniformity of cell
shape, marginal adhesion, single epithelial cell size, bare
nuclei, bland chromatin, normal nucleoli, and mitoses.
All values were integers in the range of 1-10.
- BUPA Liver Disorders
from Richard S. Forsyth at BUPA Medical Research Ltd. Each record
in the dataset constitutes the record of a single male
individual. The six attributes in each record are results from
blood tests which are thought to be sensitive to liver disorders
that might arise from excessive alcohol consumption, as well as
the number of drinks per day.
- Wine Recognition Data
from Forina, M. et al at the Institute of Pharmaceutical and Food
Analysis and Technologies. Each record represents results of a
chemical analysis of wines grown in the same region in Italy but
derived from three different cultivars. The analysis determined
the quantities of 13 constituents found in each of the three types
of wine.
The nearest-neighbor algorithm is extremely simple to implement and
leaves itself open to a wide variety of variations. In brief, the
training portion of nearest-neighbor does little more than store the
data points presented to it. When asked to make a prediction about an
unknown point, the nearest-neighbor classifier finds the closest
(according to some distance metric) training-point to the unknown point
and predicts the category of that training-point.
Put more formally, given a collection of points S and a query
point q, what is the point x closest
to q in S? The nearest-neighbor classifier
predicts that the category of q is the same as the
category of x.
Although nearest-neighbor can be efficiently implemented with a Vornoi
diagram, I took the brute force approach. This method computes the
distance from q to every point
xi in S.
I chose to use plain Euclidean distance for the distance metric in all
three algorithms. The distance from point x to
q (where both x and q
are d-dimensional vectors) is:
d(x, q) = sqrt((x - q)T(x - q))
The k-nearest-neighbor algorithm is a variation of nearest-neighbor.
Instead of looking at only the point x that is closest
to q, we instead look at the k points in
S = {x1, x2, ... , xk}
that are closest to q. Since we know the categories of
each of the k nearest points, we find the majority category in
this subset. If it exists, the query point q is
predicted to be in the that category. If no majority exists, the result
is considered inconclusive.
K-nearest-neighbor is more general than nearest-neighbor. Put another
way, nearest-neighbor is a special case of k-nearest-neighbor, where
k = 1. For the tests in this project, I chose
k = 3, k = 5, and
k = 7.
The agglomerative-nearest-neighbor algorithm is a local learning
algorithm for classification. The idea is to capture the essence of
nearest-neighbor without having to store all of the points. As the
classifier sees new training points, it clusters points points together
that predict the same category. The approach is named
agglomerative-nearest-neighbor because points are put together in
clusters as the learning process goes on. Once the clusters are formed,
the prediction for unknown points (discussed below) is very similar to
nearest neighbor.
A cluster
xi = (ci, ri, gi)
where ci is the center of a
hypersphere with radius ri, predicting category
gi. During the training stage, a new cluster
(p, rnew, g) is
formed for each training data (p, g),
where rnew is a default radius for new clusters.
For my tests, I arbitrarily chose
rnew = 1.0.
Once initialized, the new cluster is then compared to every cluster in
the system. Two clusters that are close together (within some
epsilon of each other according to the distance metric d) and classify the same
category are combined to build a new aggregate cluster (a hypersphere of
a larger radius). If the new cluster is not close to any other
clusters, it remains by itself. For my tests, I chose
epsilon = 2.15.
When two clusters are merged together, the center and the radius are
adjusted to be the average of the two clusters. If the center of
cluster xa is inside another
cluster xb, the inner cluster
xa is removed from the system and
the outer cluster xb remains
unchanged.
Since the Euclidean distance function treats every dimension equally, it
may be necessary to normalize the data. By analyzing all of the
training data in a pre-processing stage, we can determine the range of
every attribute and transform the entire dataset appropriately [1]. Normalizing every point involves scaling every
attribute between 1 and n. I chose n = 10.0
to facilitate merging of clusters. For
epsilon = 2.15, n << 10.0
would result in overlapping of conflicting classifiers (discussed below) and
n >> 10.0 would result in degeneration to
nearest-neighbor (also discussed below).
For some of the test datasets, normalization greatly decreased the
misclassification rate and for others it actually contributed to more
misclassified points. The wine recognition dataset, for example, had a
wide variation among the ranges of its thirteen attributes. One
attribute had values in the range of 0.1 - 1.7, while another
ranged from 278 - 1403. Without pre-processing, the
large-valued attribute dominated all distance comparisons, so
performance was poor.
Normalization of the bupa liver disorders dataset, on the other hand,
increased the misclassification rate. In this case, one or more
attributes with large values probably contributed more to the
classification of the data than the other attributes. By scaling all
attributes equally, the relative importance of these attributes was
diminished and other attributes were given equal weight in the
clustering.
The goal of agglomerative-nearest-neighbor was to take advantage of the
redundancy of data and save memory over nearest-neighbor. This goal was
achieved with varying success depending on the dataset.
For the breast-cancer-wisconsin dataset, agglomerative-nn stored an
average of only 45% of the points needed by nearest-neighbor and
k-nearest-neighbor. For the normalized bupa and wine datasets,
agglomerative-nn stored an average of 30% and 92%, respectively, of the
points needed by nn and k-nn. The lesser memory savings for the
normalized wine dataset are likely due to the compactness of that data;
there was little redundancy to be removed by clustering data points.
Space efficiency suffered on the datasets that were not pre-processed.
For bupa, 99% of the data was needed, and for wine 100% of the data was
needed. As discussed below,
agglomerative-nn degenerated into nearest-neighbor in these cases
because the epsilon was set too high (clusters never merged).
Looking at the error rates for these sets shows that agglomerative-nn
behaved quite similarly to nearest-neighbor.
Since the agglomerative-nn algorithm throws away some data, performance
cannot be any better than nearest-neighbor. In other words, the
algorithm trades some accuracy for space-efficiency. This effect is
most notable in the normalized bupa dataset: as noted above,
agglomerative-nn used only 30% of the data-points needed by
nearest-neighbor, but misclassified over 10% more query points. See the
results section for specifics.
In the current implementation, clusters occasionally grow even when
doing so overlaps with nearby clusters that classify differently.
Suppose two nearby clusters xm and
xn that classify into categories
Cm and Cn. Clearly
agglomerative-nn wouldn't merge them. However, what happens when we
introduce new training data xnew
for category Cm that is right between
xm and
xn and is within the
epsilon of both? Since xm
predicts the same category as the new data, the radius of
xm's hypersphere grows a little to
accommodate the new data. Eventually, with enough
xnew's of this type, either
xm or
xn could grow so that it overlaps
the other nearby cluster. Then, query points in the area of the overlap
could easily be misclassified.
One solution to the overlap problem might be to avoid merging clusters
that are close to other clusters that classify differently. That is, if
xnew causes
xm to grow, check to see if
xm overlaps
xn. If so, leave
xnew as a separate cluster. This
avoids overlap and blurring of the area of expertise.
Another interesting problem is that agglomerative-nn degenerates into
nearest-neighbor when all of the training points are far away from each
other. That is, if no two points are within some epsilon of
each other, they stay as separate clusters instead of being merged
together. For every dataset there's probably an ideal epsilon,
but if it is chosen to be too large, clusters will never merge and we're
left with nearest-neighbor.
If epsilon is chosen too low, then every point of a given class
is merged with every other point of that class; the system degenerates
into c clusters where c is the number of categories in
the dataset. Even for a linearly separable dataset, overlap of clusters
is bound to occur because clusters are hyperspheres, not ellipsoids or
rectangular prisms.
Eight training and test runs were performed for each algorithm using
a randomly chosen 90% of the data for training and 10% for testing.
With the bupa and wine datasets, tests were run with both the plain data
as it appeared in the dataset, as well as normalized.
The average misclassification rates for several datasets on each of the
three algorithms are reported below. As predicted, the agglomerative-nn
approach had a higher misclassification rate than nearest-neighbor for
all datasets.
- breast-cancer-wisconsin,
breast cancer data
- nearest-neighbor: 3.13%
- 3-nearest-neighbor: 2.02%
- 5-nearest-neighbor: 1.84%
- 7-nearest-neighbor: 1.84%
- agglomerative-nearest-neighbor: 5.7%
- bupa, liver disorders
- nearest-neighbor: unweighted 40.81%, normalized 37.5%
- 3-nearest-neighbor: unweighted 37.5%, normalized 39.34%
- 5-nearest-neighbor: unweighted 39.34%, normalized 38.24%
- 7-nearest-neighbor: unweighted 35.29%, normalized 40.07%
- agglomerative-nearest-neighbor: unweighted 38.6%, normalized 48.53%
- wine, wine recognition
- nearest-neighbor: unweighted 24.26%, normalized 2.21%
- 3-nearest-neighbor: unweighted 27.21%, normalized 2.94%
- 5-nearest-neighbor: unweighted 24.26%, normalized 1.47%
- 7-nearest-neighbor: unweighted 24.26%, normalized 1.47%
- agglomerative-nearest-neighbor: unweighted 33.82%, normalized 2.94%
The agglomerative-nearest-neighbor algorithm worked quite well with very
redundant datasets. Although the average misclassification rates were
higher with agglomerative-nn than nearest-neighbor for the breast cancer
and wine recognition datasets, the memory savings were quite noticable.
Other redundant datasets would probably see similar improvements.
Unfortunately, agglomerative-nn is substantially slower than
nearest-neighbor in the training stage, since nearest-neighbor merely
stores points during training. Since agglomerative-nn compares the new
cluster to every other cluster in the system, worst case time-complexity
is O(N2). For large datasets (more than
1000 records), agglomerative-nn is too slow to see real-time results.
With nonredunant datasets, an efficient nearest-neighbor implementation
would be preferred over agglomerative-nn for both speed and accuracy.
Choice of parameters is an area that needs more exploration. Although
the choice for epsilon = 2.15 was a function of the
values used for rnew and the normalization factor
(1.0 and 10.0, respectively), these values were chosen somewhat
arbitrarily. Varying rnew and the normalization
factor with the number of attributes in a dataset might lead to better
accuracy because as dimensionality grows, a radius of 1.0 becomes less
and less significant.
Some more careful weighting of data in the pre-processing stage could
also lead to better accuracy in all three algorithms. Heavily weighting
those attributes that more greatly contribute to the classification of
points would skew the distances more in the dimension that matters most.
- Bishop, C. 1995. Neural Networks for Pattern
Recognition, p. 6-7. Oxford: Clarendon Press.
- Schaal, S., and Atkeson, C. 1996. From
Isolation to Cooperation: An Alternative View of a System of
Experts. In: Touretzky, D. S., Mozer, M. C., and Hasselmo,
M. E., eds., Advances in Neural Information Processing Systems
8, pp. 605-11. Cambridge, Massachusetts: MIT Press. ftp://ftp.cc.gatech.edu/pub/people/sschaal/schaal-NIPS95.html.
- Merz, C.J., & Murphy, P.M. (1998). UCI
Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html].
Irvine, CA: University of California, Department of Information
and Computer Science.
Thanks to Prof. Leslie
Pack Kaelbling for her suggestions on the paper and guidance in the
machine learning seminar. Thanks also to Fred Piccioto and Jason Lango for
algorithmic suggestions, and to Mike Cafarella for
coining the name "agglomerative-nearest-neighbor".
- learning.lsp, the
nearest-neighbor, k-nn, and agglomerative-nn learning algorithms.
- test.lsp, automated tests for the
three algorithms.
- datapoint.lsp, definitions
for the data structures used in learning.
- fileio.lsp, code to snarf up text
files.
You can grab all of the code, perl scripts for munging data into
lisp-readable format, datasets, and results in a gzip'ed tarfile:
Michael J. Radwin
Last modified: Mon Nov 4 19:15:47 EST 2002