# Introduction

This document presents the results from my Topics in Machine Learning final project at Brown University, written in December 1996. The project was a “bakeoff” for machine learning algorithms that learned classification problems (categories instead of continuous functions). I discuss my implementation and results for Nearest-Neighbor and k-Nearest-Neighbor, two well-known machine learning algorithms, on three classification datasets. In addition, I introduce and discuss the Agglomerative-Nearest-Neighbor algorithm, a variation of nearest-neighbor that clusters points together.

## Datasets

Three classification datasets from the UC Irvine Machine Learning Repository [3] were chosen for testing:

• Wisconsin Breast Cancer from Dr. William H. Wolberg at the University of Wisconsin Hospitals. Records in the dataset represent the results of breast cytology tests and a diagnosis of benign or malignant. Attributes include clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. All values were integers in the range of 1-10.
• BUPA Liver Disorders from Richard S. Forsyth at BUPA Medical Research Ltd. Each record in the dataset constitutes the record of a single male individual. The six attributes in each record are results from blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption, as well as the number of drinks per day.
• Wine Recognition Data from Forina, M. et al at the Institute of Pharmaceutical and Food Analysis and Technologies. Each record represents results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wine.

# Nearest-Neighbor

The nearest-neighbor algorithm is extremely simple to implement and leaves itself open to a wide variety of variations. In brief, the training portion of nearest-neighbor does little more than store the data points presented to it. When asked to make a prediction about an unknown point, the nearest-neighbor classifier finds the closest (according to some distance metric) training-point to the unknown point and predicts the category of that training-point.

Put more formally, given a collection of points S and a query point q, what is the point x closest to q in S? The nearest-neighbor classifier predicts that the category of q is the same as the category of x.

Although nearest-neighbor can be efficiently implemented with a Vornoi diagram, I took the brute force approach. This method computes the distance from q to every point xi in S.

## Distance Metric

I chose to use plain Euclidean distance for the distance metric in all three algorithms. The distance from point x to q (where both x and q are d-dimensional vectors) is:

d(xq) = sqrt((x – q)T(x – q))

# K-Nearest-Neighbor

The k-nearest-neighbor algorithm is a variation of nearest-neighbor. Instead of looking at only the point x that is closest to q, we instead look at the k points in S = {x1x2, … , xk} that are closest to q. Since we know the categories of each of the k nearest points, we find the majority category in this subset. If it exists, the query point q is predicted to be in the that category. If no majority exists, the result is considered inconclusive.

K-nearest-neighbor is more general than nearest-neighbor. Put another way, nearest-neighbor is a special case of k-nearest-neighbor, where k = 1. For the tests in this project, I chose k = 3, k = 5, andk = 7.

# Agglomerative-Nearest-Neighbor

The agglomerative-nearest-neighbor algorithm is a local learning algorithm for classification. The idea is to capture the essence of nearest-neighbor without having to store all of the points. As the classifier sees new training points, it clusters points points together that predict the same category. The approach is named agglomerative-nearest-neighbor because points are put together in clusters as the learning process goes on. Once the clusters are formed, the prediction for unknown points (discussed below) is very similar to nearest neighbor.

## Training

A cluster xi = (cirigi) where ci is the center of a hypersphere with radius ri, predicting category gi. During the training stage, a new cluster (prnewg) is formed for each training data (pg), where rnewis a default radius for new clusters. For my tests, I arbitrarily chose rnew = 1.0.

Once initialized, the new cluster is then compared to every cluster in the system. Two clusters that are close together (within some epsilon of each other according to the distance metric d) and classify the same category are combined to build a new aggregate cluster (a hypersphere of a larger radius). If the new cluster is not close to any other clusters, it remains by itself. For my tests, I chose epsilon = 2.15.

When two clusters are merged together, the center and the radius are adjusted to be the average of the two clusters. If the center of cluster xa is inside another cluster xb, the inner cluster xa is removed from the system and the outer cluster xb remains unchanged.

## Pre-processing and normalization

Since the Euclidean distance function treats every dimension equally, it may be necessary to normalize the data. By analyzing all of the training data in a pre-processing stage, we can determine the range of every attribute and transform the entire dataset appropriately [1]. Normalizing every point involves scaling every attribute between 1 and n. I chose n = 10.0 to facilitate merging of clusters. Forepsilon = 2.15, n << 10.0 would result in overlapping of conflicting classifiers (discussed below) and n >> 10.0 would result in degeneration to nearest-neighbor (also discussed below).

For some of the test datasets, normalization greatly decreased the misclassification rate and for others it actually contributed to more misclassified points. The wine recognition dataset, for example, had a wide variation among the ranges of its thirteen attributes. One attribute had values in the range of 0.1 – 1.7, while another ranged from 278 – 1403. Without pre-processing, the large-valued attribute dominated all distance comparisons, so performance was poor.

Normalization of the bupa liver disorders dataset, on the other hand, increased the misclassification rate. In this case, one or more attributes with large values probably contributed more to the classification of the data than the other attributes. By scaling all attributes equally, the relative importance of these attributes was diminished and other attributes were given equal weight in the clustering.

## Advantages: space-efficiency

The goal of agglomerative-nearest-neighbor was to take advantage of the redundancy of data and save memory over nearest-neighbor. This goal was achieved with varying success depending on the dataset.

For the breast-cancer-wisconsin dataset, agglomerative-nn stored an average of only 45% of the points needed by nearest-neighbor and k-nearest-neighbor. For the normalized bupa and wine datasets, agglomerative-nn stored an average of 30% and 92%, respectively, of the points needed by nn and k-nn. The lesser memory savings for the normalized wine dataset are likely due to the compactness of that data; there was little redundancy to be removed by clustering data points.

Space efficiency suffered on the datasets that were not pre-processed. For bupa, 99% of the data was needed, and for wine 100% of the data was needed. As discussed below, agglomerative-nn degenerated into nearest-neighbor in these cases because the epsilon was set too high (clusters never merged). Looking at the error rates for these sets shows that agglomerative-nn behaved quite similarly to nearest-neighbor.

## Problems with the approach

### Space-efficiency and accuracy tradeoff

Since the agglomerative-nn algorithm throws away some data, performance cannot be any better than nearest-neighbor. In other words, the algorithm trades some accuracy for space-efficiency. This effect is most notable in the normalized bupa dataset: as noted above, agglomerative-nn used only 30% of the data-points needed by nearest-neighbor, but misclassified over 10% more query points. See theresults section for specifics.

### Overlap of clusters

In the current implementation, clusters occasionally grow even when doing so overlaps with nearby clusters that classify differently. Suppose two nearby clusters xm and xn that classify into categories Cmand Cn. Clearly agglomerative-nn wouldn’t merge them. However, what happens when we introduce new training data xnew for category Cm that is right between xm and xn and is within the epsilon of both? Since xm predicts the same category as the new data, the radius of xm‘s hypersphere grows a little to accommodate the new data. Eventually, with enough xnew‘s of this type, either xm or xn could grow so that it overlaps the other nearby cluster. Then, query points in the area of the overlap could easily be misclassified.

One solution to the overlap problem might be to avoid merging clusters that are close to other clusters that classify differently. That is, if xnew causes xm to grow, check to see if xm overlaps xn. If so, leavexnew as a separate cluster. This avoids overlap and blurring of the area of expertise.

### Degeneratation

Another interesting problem is that agglomerative-nn degenerates into nearest-neighbor when all of the training points are far away from each other. That is, if no two points are within some epsilon of each other, they stay as separate clusters instead of being merged together. For every dataset there’s probably an ideal epsilon, but if it is chosen to be too large, clusters will never merge and we’re left with nearest-neighbor.

If epsilon is chosen too low, then every point of a given class is merged with every other point of that class; the system degenerates into c clusters where c is the number of categories in the dataset. Even for a linearly separable dataset, overlap of clusters is bound to occur because clusters are hyperspheres, not ellipsoids or rectangular prisms.

# Test results summary

Eight training and test runs were performed for each algorithm using a randomly chosen 90% of the data for training and 10% for testing. With the bupa and wine datasets, tests were run with both the plain data as it appeared in the dataset, as well as normalized.

The average misclassification rates for several datasets on each of the three algorithms are reported below. As predicted, the agglomerative-nn approach had a higher misclassification rate than nearest-neighbor for all datasets.

• breast-cancer-wisconsin, breast cancer data
• nearest-neighbor: 3.13%
• 3-nearest-neighbor: 2.02%
• 5-nearest-neighbor: 1.84%
• 7-nearest-neighbor: 1.84%
• agglomerative-nearest-neighbor: 5.7%
• bupa, liver disorders
• nearest-neighbor: unweighted 40.81%, normalized 37.5%
• 3-nearest-neighbor: unweighted 37.5%, normalized 39.34%
• 5-nearest-neighbor: unweighted 39.34%, normalized 38.24%
• 7-nearest-neighbor: unweighted 35.29%, normalized 40.07%
• agglomerative-nearest-neighbor: unweighted 38.6%, normalized 48.53%
• wine, wine recognition
• nearest-neighbor: unweighted 24.26%, normalized 2.21%
• 3-nearest-neighbor: unweighted 27.21%, normalized 2.94%
• 5-nearest-neighbor: unweighted 24.26%, normalized 1.47%
• 7-nearest-neighbor: unweighted 24.26%, normalized 1.47%
• agglomerative-nearest-neighbor: unweighted 33.82%, normalized 2.94%

# Conclusions

The agglomerative-nearest-neighbor algorithm worked quite well with very redundant datasets. Although the average misclassification rates were higher with agglomerative-nn than nearest-neighbor for the breast cancer and wine recognition datasets, the memory savings were quite noticable. Other redundant datasets would probably see similar improvements.

Unfortunately, agglomerative-nn is substantially slower than nearest-neighbor in the training stage, since nearest-neighbor merely stores points during training. Since agglomerative-nn compares the new cluster to every other cluster in the system, worst case time-complexity is O(N2). For large datasets (more than 1000 records), agglomerative-nn is too slow to see real-time results. With nonredunant datasets, an efficient nearest-neighbor implementation would be preferred over agglomerative-nn for both speed and accuracy.

Choice of parameters is an area that needs more exploration. Although the choice for epsilon = 2.15 was a function of the values used for rnew and the normalization factor (1.0 and 10.0, respectively), these values were chosen somewhat arbitrarily. Varying rnew and the normalization factor with the number of attributes in a dataset might lead to better accuracy because as dimensionality grows, a radius of 1.0 becomes less and less significant.

Some more careful weighting of data in the pre-processing stage could also lead to better accuracy in all three algorithms. Heavily weighting those attributes that more greatly contribute to the classification of points would skew the distances more in the dimension that matters most.

# References

1. Bishop, C. 1995. Neural Networks for Pattern Recognition, p. 6-7. Oxford: Clarendon Press.
2. Schaal, S., and Atkeson, C. 1996. From Isolation to Cooperation: An Alternative View of a System of Experts. In: Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., eds., Advances in Neural Information Processing Systems 8, pp. 605-11. Cambridge, Massachusetts: MIT Press. ftp://ftp.cc.gatech.edu/pub/people/sschaal/schaal-NIPS95.html.
3. Merz, C.J., & Murphy, P.M. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.

# Acknowledgements

Thanks to Prof. Leslie Pack Kaelbling for her suggestions on the paper and guidance in the machine learning seminar. Thanks also to Fred Piccioto and Jason Lango for algorithmic suggestions, and to Mike Cafarella for coining the name “agglomerative-nearest-neighbor”.

# Source code (in Common Lisp)

• learning.lsp, the nearest-neighbor, k-nn, and agglomerative-nn learning algorithms.
• test.lsp, automated tests for the three algorithms.
• datapoint.lsp, definitions for the data structures used in learning.
• fileio.lsp, code to snarf up text files.

## Entire project

You can grab all of the code, perl scripts for munging data into lisp-readable format, datasets, and results in a gzip’ed tarfile:

## wine results (Nearest-Neighbor Machine Learning Bakeoff)

Wine Recognition Data from Forina, M. et al at the Institute of Pharmaceutical and Food Analysis and Technologies. Each record represents results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wine.

# agglomerative-nearest-neighbor

## bupa results (Nearest-Neighbor Machine Learning Bakeoff)

BUPA Liver Disorders from Richard S. Forsyth at BUPA Medical Research Ltd. Each record in the dataset constitutes the record of a single male individual. The six attributes in each record are results from blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption, as well as the number of drinks per day.

# agglomerative-nearest-neighbor

## breast-cancer-wisconsin results (Nearest-Neighbor Machine Learning Bakeoff)

Wisconsin Breast Cancer from Dr. William H. Wolberg at the University of Wisconsin Hospitals, Madison. Records in the dataset represent the results of breast cytology tests and a diagnosis of benign or malignant. Attributes include clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. All values were integers in the range of 1-10.

# agglomerative-nearest-neighbor

## Wine recognition dataset (Nearest-Neighbor Machine Learning Bakeoff)

Brief information from the UC Irvine Machine Learning Repository:

• Donated by Stefan Aeberhard
• Using chemical analysis determine the origin of wines
• 13 attributes (all continuous), 3 classes, no missing values
• 178 instances
• Ftp Access
1. Title of Database: Wine recognition data
2. Sources:
(a) Forina, M. et al, PARVUS - An Extendible Package for Data
Exploration, Classification and Correlation. Institute of Pharmaceutical
and Food Analysis and Technologies, Via Brigata Salerno,
16147 Genoa, Italy.

(b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au
(c) July 1991
3. Past Usage:

(1)
S. Aeberhard, D. Coomans and O. de Vel,
Comparison of Classifiers in High Dimensional Settings,
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Technometrics).

The data was used with many others for comparing various
classifiers. The classes are separable, though only RDA
has achieved 100% correct classification.
(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
(All results using the leave-one-out technique)

In a classification context, this is a well posed problem
with "well behaved" class structures. A good data set
for first testing of a new classifier, but not very
challenging.

(2)
S. Aeberhard, D. Coomans and O. de Vel,
"THE CLASSIFICATION PERFORMANCE OF RDA"
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Journal of Chemometrics).

Here, the data was used to illustrate the superior performance of
the use of a new appreciation function with RDA.

4. Relevant Information:

-- These data are the results of a chemical analysis of
wines grown in the same region in Italy but derived from three
different cultivars.
The analysis determined the quantities of 13 constituents
found in each of the three types of wines.

-- I think that the initial data set had around 30 variables, but
for some reason I only have the 13 dimensional version.
I had a list of what the 30 or so variables were, but a.)
I lost it, and b.), I would not know which 13 variables
are included in the set.

5. Number of Instances

class 1 59
class 2 71
class 3 48

6. Number of Attributes

13

7. For Each Attribute:

All attributes are continuous

No statistics available, but suggest to standardise
variables for certain uses (e.g. for us with classifiers
which are NOT scale invariant)

NOTE: 1st attribute is class identifier (1-3)

8. Missing Attribute Values:

None

9. Class Distribution: number of instances per class

class 1 59
class 2 71
class 3 48

Further information from one of the authors of the wine database:

Date: Mon, 9 Mar 1998 15:44:07 GMT
To: michael@radwin.org
From: riclea@crazy.anchem.unige.it (Riccardo Leardi)
Subject: wines

Dear Michael,
I'm one of the authors of PARVUS.
I saw your site and the reference to the data set wines.
Therefore, I think it could be interesting for you to know the names of the
variables (from the 27 original ones):

1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline

If you are interested, I can send you by snail mail the original paper
(please send me your address).

Best regards
Riccardo Leardi

## BUPA liver disorders dataset (Nearest-Neighbor Machine Learning Bakeoff)

Brief information from the UC Irvine Machine Learning Repository:

• BUPA Medical Research Ltd. database donated by Richard S. Forsyth
• 7 numeric-valued attributes
• 345 instances (male patients)
• Includes cost data (donated by Peter Turney)
• Ftp Access
1. Title: BUPA liver disorders

2. Source information:
-- Creators: BUPA Medical Research Ltd.
-- Donor: Richard S. Forsyth
8 Grosvenor Avenue
Mapperley Park
Nottingham NG3 5DX
0602-621676
-- Date: 5/15/1990

3. Past usage:
-- None known other than what is shown in the PC/BEAGLE User's Guide
(written by Richard S. Forsyth).

4. Relevant information:
-- The first 5 variables are all blood tests which are thought
to be sensitive to liver disorders that might arise from
excessive alcohol consumption.  Each line in the bupa.data file
constitutes the record of a single male individual.
-- It appears that drinks>5 is some sort of a selector on this database.
See the PC/BEAGLE User's Guide for more information.

5. Number of instances: 345

6. Number of attributes: 7 overall

7. Attribute information:
1. mcv	mean corpuscular volume
2. alkphos	alkaline phosphotase
3. sgpt	alamine aminotransferase
4. sgot 	aspartate aminotransferase
5. gammagt	gamma-glutamyl transpeptidase
6. drinks	number of half-pint equivalents of alcoholic beverages
drunk per day
7. selector  field used to split data into two sets

8. Missing values: none

## beast-cancer-wisconsin dataset (Nearest-Neighbor Machine Learning Bakeoff)

Brief information from the UC Irvine Machine Learning Repository:

• Donated by Olvi Mangasarian
• Located in breast-cancer-wisconsin sub-directory, filenames root: breast-cancer-wisconsin
• Currently contains 699 instances
• 2 classes (malignant and benign)
• 9 integer-valued attributes
• Ftp Access

Citation Request:
This breast cancer databases was obtained from the University of Wisconsin
Hospitals, Madison from Dr. William H. Wolberg.  If you publish results
when using this database, then please include this information in your
acknowledgements.  Also, please cite one or more of:

1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear
programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of
pattern separation for medical diagnosis applied to breast cytology",
Proceedings of the National Academy of Sciences, U.S.A., Volume 87,
December 1990, pp 9193-9196.

3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition
via linear programming: Theory and application to medical diagnosis",
in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.

4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming
discrimination of two linearly inseparable sets", Optimization Methods
and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

1. Title: Wisconsin Breast Cancer Database (January 8, 1991)

2. Sources:
-- Dr. WIlliam H. Wolberg (physician)
University of Wisconsin Hospitals
Madison, Wisconsin
USA
-- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
Received by David W. Aha (aha@cs.jhu.edu)
-- Date: 15 July 1992

3. Past Usage:

Attributes 2 through 10 have been used to represent instances.
Each instance has one of 2 possible classes: benign or malignant.

1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of
pattern separation for medical diagnosis applied to breast cytology. In
{\it Proceedings of the National Academy of Sciences}, {\it 87},
9193--9196.
-- Size of data set: only 369 instances (at that point in time)
-- Collected classification results: 1 trial only
-- Two pairs of parallel hyperplanes were found to be consistent with
50% of the data
-- Accuracy on remaining 50% of dataset: 93.5%
-- Three pairs of parallel hyperplanes were found to be consistent with
67% of data
-- Accuracy on remaining 33% of dataset: 95.9%

2. Zhang,~J. (1992). Selecting typical instances in instance-based
learning.  In {\it Proceedings of the Ninth International Machine
Learning Conference} (pp. 470--479).  Aberdeen, Scotland: Morgan
Kaufmann.
-- Size of data set: only 369 instances (at that point in time)
-- Applied 4 instance-based learning algorithms
-- Collected classification results averaged over 10 trials
-- Best accuracy result:
-- 1-nearest neighbor: 93.7%
-- trained on 200 instances, tested on the other 169
-- Also of interest:
-- Using only typical instances: 92.2% (storing only 23.1 instances)
-- trained on 200 instances, tested on the other 169

4. Relevant Information:

Samples arrive periodically as Dr. Wolberg reports his clinical cases.
The database therefore reflects this chronological grouping of the data.
This grouping information appears immediately below, having been removed
from the data itself:

Group 1: 367 instances (January 1989)
Group 2:  70 instances (October 1989)
Group 3:  31 instances (February 1990)
Group 4:  17 instances (April 1990)
Group 5:  48 instances (August 1990)
Group 6:  49 instances (Updated January 1991)
Group 7:  31 instances (June 1991)
Group 8:  86 instances (November 1991)
-----------------------------------------
Total:   699 points (as of the donated datbase on 15 July 1992)

Note that the results summarized above in Past Usage refer to a dataset
of size 369, while Group 1 has only 367 instances.  This is because it
originally contained 369 instances; 2 were removed.  The following
statements summarizes changes to the original Group 1's set of data:

#####  Group 1 : 367 points: 200B 167M (January 1989)
#####  Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
#####  Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record
#####                  : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
#####                  : Changed 0 to 1 in field 6 of sample 1219406
#####                  : Changed 0 to 1 in field 8 of following sample:
#####                  : 1182404,2,3,1,1,1,2,0,1,1,1

5. Number of Instances: 699 (as of 15 July 1992)

6. Number of Attributes: 10 plus the class attribute

7. Attribute Information: (class attribute has been moved to last column)

#  Attribute                     Domain
-- -----------------------------------------
1. Sample code number            id number
2. Clump Thickness               1 - 10
3. Uniformity of Cell Size       1 - 10
4. Uniformity of Cell Shape      1 - 10
5. Marginal Adhesion             1 - 10
6. Single Epithelial Cell Size   1 - 10
7. Bare Nuclei                   1 - 10
8. Bland Chromatin               1 - 10
9. Normal Nucleoli               1 - 10
10. Mitoses                       1 - 10
11. Class:                        (2 for benign, 4 for malignant)

8. Missing attribute values: 16

There are 16 instances in Groups 1 to 6 that contain a single missing
(i.e., unavailable) attribute value, now denoted by "?".

9. Class distribution:

Benign: 458 (65.5%)
Malignant: 241 (34.5%)