WhatsForDinner
was a perl script that prints out today’s University Food Services lunch and dinner menus. Makes use of lynx and the BDH’s daily menu.
WhatsForDinner
was a perl script that prints out today’s University Food Services lunch and dinner menus. Makes use of lynx and the BDH’s daily menu.
This document presents the results from my Topics in Machine Learning final project at Brown University, written in December 1996. The project was a “bakeoff” for machine learning algorithms that learned classification problems (categories instead of continuous functions). I discuss my implementation and results for Nearest-Neighbor and k-Nearest-Neighbor, two well-known machine learning algorithms, on three classification datasets. In addition, I introduce and discuss the Agglomerative-Nearest-Neighbor algorithm, a variation of nearest-neighbor that clusters points together.
Three classification datasets from the UC Irvine Machine Learning Repository [3] were chosen for testing:
The nearest-neighbor algorithm is extremely simple to implement and leaves itself open to a wide variety of variations. In brief, the training portion of nearest-neighbor does little more than store the data points presented to it. When asked to make a prediction about an unknown point, the nearest-neighbor classifier finds the closest (according to some distance metric) training-point to the unknown point and predicts the category of that training-point.
Put more formally, given a collection of points S and a query point q, what is the point x closest to q in S? The nearest-neighbor classifier predicts that the category of q is the same as the category of x.
Although nearest-neighbor can be efficiently implemented with a Vornoi diagram, I took the brute force approach. This method computes the distance from q to every point xi in S.
I chose to use plain Euclidean distance for the distance metric in all three algorithms. The distance from point x to q (where both x and q are d-dimensional vectors) is:
d(x, q) = sqrt((x – q)T(x – q))
The k-nearest-neighbor algorithm is a variation of nearest-neighbor. Instead of looking at only the point x that is closest to q, we instead look at the k points in S = {x1, x2, … , xk} that are closest to q. Since we know the categories of each of the k nearest points, we find the majority category in this subset. If it exists, the query point q is predicted to be in the that category. If no majority exists, the result is considered inconclusive.
K-nearest-neighbor is more general than nearest-neighbor. Put another way, nearest-neighbor is a special case of k-nearest-neighbor, where k = 1. For the tests in this project, I chose k = 3, k = 5, andk = 7.
The agglomerative-nearest-neighbor algorithm is a local learning algorithm for classification. The idea is to capture the essence of nearest-neighbor without having to store all of the points. As the classifier sees new training points, it clusters points points together that predict the same category. The approach is named agglomerative-nearest-neighbor because points are put together in clusters as the learning process goes on. Once the clusters are formed, the prediction for unknown points (discussed below) is very similar to nearest neighbor.
A cluster xi = (ci, ri, gi) where ci is the center of a hypersphere with radius ri, predicting category gi. During the training stage, a new cluster (p, rnew, g) is formed for each training data (p, g), where rnewis a default radius for new clusters. For my tests, I arbitrarily chose rnew = 1.0.
Once initialized, the new cluster is then compared to every cluster in the system. Two clusters that are close together (within some epsilon of each other according to the distance metric d) and classify the same category are combined to build a new aggregate cluster (a hypersphere of a larger radius). If the new cluster is not close to any other clusters, it remains by itself. For my tests, I chose epsilon = 2.15.
When two clusters are merged together, the center and the radius are adjusted to be the average of the two clusters. If the center of cluster xa is inside another cluster xb, the inner cluster xa is removed from the system and the outer cluster xb remains unchanged.
Since the Euclidean distance function treats every dimension equally, it may be necessary to normalize the data. By analyzing all of the training data in a pre-processing stage, we can determine the range of every attribute and transform the entire dataset appropriately [1]. Normalizing every point involves scaling every attribute between 1 and n. I chose n = 10.0 to facilitate merging of clusters. Forepsilon = 2.15, n << 10.0 would result in overlapping of conflicting classifiers (discussed below) and n >> 10.0 would result in degeneration to nearest-neighbor (also discussed below).
For some of the test datasets, normalization greatly decreased the misclassification rate and for others it actually contributed to more misclassified points. The wine recognition dataset, for example, had a wide variation among the ranges of its thirteen attributes. One attribute had values in the range of 0.1 – 1.7, while another ranged from 278 – 1403. Without pre-processing, the large-valued attribute dominated all distance comparisons, so performance was poor.
Normalization of the bupa liver disorders dataset, on the other hand, increased the misclassification rate. In this case, one or more attributes with large values probably contributed more to the classification of the data than the other attributes. By scaling all attributes equally, the relative importance of these attributes was diminished and other attributes were given equal weight in the clustering.
The goal of agglomerative-nearest-neighbor was to take advantage of the redundancy of data and save memory over nearest-neighbor. This goal was achieved with varying success depending on the dataset.
For the breast-cancer-wisconsin dataset, agglomerative-nn stored an average of only 45% of the points needed by nearest-neighbor and k-nearest-neighbor. For the normalized bupa and wine datasets, agglomerative-nn stored an average of 30% and 92%, respectively, of the points needed by nn and k-nn. The lesser memory savings for the normalized wine dataset are likely due to the compactness of that data; there was little redundancy to be removed by clustering data points.
Space efficiency suffered on the datasets that were not pre-processed. For bupa, 99% of the data was needed, and for wine 100% of the data was needed. As discussed below, agglomerative-nn degenerated into nearest-neighbor in these cases because the epsilon was set too high (clusters never merged). Looking at the error rates for these sets shows that agglomerative-nn behaved quite similarly to nearest-neighbor.
Since the agglomerative-nn algorithm throws away some data, performance cannot be any better than nearest-neighbor. In other words, the algorithm trades some accuracy for space-efficiency. This effect is most notable in the normalized bupa dataset: as noted above, agglomerative-nn used only 30% of the data-points needed by nearest-neighbor, but misclassified over 10% more query points. See theresults section for specifics.
In the current implementation, clusters occasionally grow even when doing so overlaps with nearby clusters that classify differently. Suppose two nearby clusters xm and xn that classify into categories Cmand Cn. Clearly agglomerative-nn wouldn’t merge them. However, what happens when we introduce new training data xnew for category Cm that is right between xm and xn and is within the epsilon of both? Since xm predicts the same category as the new data, the radius of xm‘s hypersphere grows a little to accommodate the new data. Eventually, with enough xnew‘s of this type, either xm or xn could grow so that it overlaps the other nearby cluster. Then, query points in the area of the overlap could easily be misclassified.
One solution to the overlap problem might be to avoid merging clusters that are close to other clusters that classify differently. That is, if xnew causes xm to grow, check to see if xm overlaps xn. If so, leavexnew as a separate cluster. This avoids overlap and blurring of the area of expertise.
Another interesting problem is that agglomerative-nn degenerates into nearest-neighbor when all of the training points are far away from each other. That is, if no two points are within some epsilon of each other, they stay as separate clusters instead of being merged together. For every dataset there’s probably an ideal epsilon, but if it is chosen to be too large, clusters will never merge and we’re left with nearest-neighbor.
If epsilon is chosen too low, then every point of a given class is merged with every other point of that class; the system degenerates into c clusters where c is the number of categories in the dataset. Even for a linearly separable dataset, overlap of clusters is bound to occur because clusters are hyperspheres, not ellipsoids or rectangular prisms.
Eight training and test runs were performed for each algorithm using a randomly chosen 90% of the data for training and 10% for testing. With the bupa and wine datasets, tests were run with both the plain data as it appeared in the dataset, as well as normalized.
The average misclassification rates for several datasets on each of the three algorithms are reported below. As predicted, the agglomerative-nn approach had a higher misclassification rate than nearest-neighbor for all datasets.
The agglomerative-nearest-neighbor algorithm worked quite well with very redundant datasets. Although the average misclassification rates were higher with agglomerative-nn than nearest-neighbor for the breast cancer and wine recognition datasets, the memory savings were quite noticable. Other redundant datasets would probably see similar improvements.
Unfortunately, agglomerative-nn is substantially slower than nearest-neighbor in the training stage, since nearest-neighbor merely stores points during training. Since agglomerative-nn compares the new cluster to every other cluster in the system, worst case time-complexity is O(N2). For large datasets (more than 1000 records), agglomerative-nn is too slow to see real-time results. With nonredunant datasets, an efficient nearest-neighbor implementation would be preferred over agglomerative-nn for both speed and accuracy.
Choice of parameters is an area that needs more exploration. Although the choice for epsilon = 2.15 was a function of the values used for rnew and the normalization factor (1.0 and 10.0, respectively), these values were chosen somewhat arbitrarily. Varying rnew and the normalization factor with the number of attributes in a dataset might lead to better accuracy because as dimensionality grows, a radius of 1.0 becomes less and less significant.
Some more careful weighting of data in the pre-processing stage could also lead to better accuracy in all three algorithms. Heavily weighting those attributes that more greatly contribute to the classification of points would skew the distances more in the dimension that matters most.
Thanks to Prof. Leslie Pack Kaelbling for her suggestions on the paper and guidance in the machine learning seminar. Thanks also to Fred Piccioto and Jason Lango for algorithmic suggestions, and to Mike Cafarella for coining the name “agglomerative-nearest-neighbor”.
You can grab all of the code, perl scripts for munging data into lisp-readable format, datasets, and results in a gzip’ed tarfile:
Wine Recognition Data from Forina, M. et al at the Institute of Pharmaceutical and Food Analysis and Technologies. Each record represents results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wine.
BUPA Liver Disorders from Richard S. Forsyth at BUPA Medical Research Ltd. Each record in the dataset constitutes the record of a single male individual. The six attributes in each record are results from blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption, as well as the number of drinks per day.
Wisconsin Breast Cancer from Dr. William H. Wolberg at the University of Wisconsin Hospitals, Madison. Records in the dataset represent the results of breast cytology tests and a diagnosis of benign or malignant. Attributes include clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. All values were integers in the range of 1-10.
Brief information from the UC Irvine Machine Learning Repository:
1. Title of Database: Wine recognition data 2. Sources: (a) Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy. (b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au (c) July 1991 3. Past Usage: (1) S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Technometrics). The data was used with many others for comparing various classifiers. The classes are separable, though only RDA has achieved 100% correct classification. (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) (All results using the leave-one-out technique) In a classification context, this is a well posed problem with "well behaved" class structures. A good data set for first testing of a new classifier, but not very challenging. (2) S. Aeberhard, D. Coomans and O. de Vel, "THE CLASSIFICATION PERFORMANCE OF RDA" Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Journal of Chemometrics). Here, the data was used to illustrate the superior performance of the use of a new appreciation function with RDA. 4. Relevant Information: -- These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. -- I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set. 5. Number of Instances class 1 59 class 2 71 class 3 48 6. Number of Attributes 13 7. For Each Attribute: All attributes are continuous No statistics available, but suggest to standardise variables for certain uses (e.g. for us with classifiers which are NOT scale invariant) NOTE: 1st attribute is class identifier (1-3) 8. Missing Attribute Values: None 9. Class Distribution: number of instances per class class 1 59 class 2 71 class 3 48
Further information from one of the authors of the wine database:
Date: Mon, 9 Mar 1998 15:44:07 GMT To: michael@radwin.org From: riclea@crazy.anchem.unige.it (Riccardo Leardi) Subject: wines Dear Michael, I'm one of the authors of PARVUS. I saw your site and the reference to the data set wines. Therefore, I think it could be interesting for you to know the names of the variables (from the 27 original ones): 1) Alcohol 2) Malic acid 3) Ash 4) Alcalinity of ash 5) Magnesium 6) Total phenols 7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10)Color intensity 11)Hue 12)OD280/OD315 of diluted wines 13)Proline If you are interested, I can send you by snail mail the original paper (please send me your address). Best regards Riccardo Leardi
Brief information from the UC Irvine Machine Learning Repository:
1. Title: BUPA liver disorders 2. Source information: -- Creators: BUPA Medical Research Ltd. -- Donor: Richard S. Forsyth 8 Grosvenor Avenue Mapperley Park Nottingham NG3 5DX 0602-621676 -- Date: 5/15/1990 3. Past usage: -- None known other than what is shown in the PC/BEAGLE User's Guide (written by Richard S. Forsyth). 4. Relevant information: -- The first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each line in the bupa.data file constitutes the record of a single male individual. -- It appears that drinks>5 is some sort of a selector on this database. See the PC/BEAGLE User's Guide for more information. 5. Number of instances: 345 6. Number of attributes: 7 overall 7. Attribute information: 1. mcv mean corpuscular volume 2. alkphos alkaline phosphotase 3. sgpt alamine aminotransferase 4. sgot aspartate aminotransferase 5. gammagt gamma-glutamyl transpeptidase 6. drinks number of half-pint equivalents of alcoholic beverages drunk per day 7. selector field used to split data into two sets 8. Missing values: none
Brief information from the UC Irvine Machine Learning Repository:
Citation Request: This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. If you publish results when using this database, then please include this information in your acknowledgements. Also, please cite one or more of: 1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18. 2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196. 3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30. 4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers). 1. Title: Wisconsin Breast Cancer Database (January 8, 1991) 2. Sources: -- Dr. WIlliam H. Wolberg (physician) University of Wisconsin Hospitals Madison, Wisconsin USA -- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu) Received by David W. Aha (aha@cs.jhu.edu) -- Date: 15 July 1992 3. Past Usage: Attributes 2 through 10 have been used to represent instances. Each instance has one of 2 possible classes: benign or malignant. 1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In {\it Proceedings of the National Academy of Sciences}, {\it 87}, 9193--9196. -- Size of data set: only 369 instances (at that point in time) -- Collected classification results: 1 trial only -- Two pairs of parallel hyperplanes were found to be consistent with 50% of the data -- Accuracy on remaining 50% of dataset: 93.5% -- Three pairs of parallel hyperplanes were found to be consistent with 67% of data -- Accuracy on remaining 33% of dataset: 95.9% 2. Zhang,~J. (1992). Selecting typical instances in instance-based learning. In {\it Proceedings of the Ninth International Machine Learning Conference} (pp. 470--479). Aberdeen, Scotland: Morgan Kaufmann. -- Size of data set: only 369 instances (at that point in time) -- Applied 4 instance-based learning algorithms -- Collected classification results averaged over 10 trials -- Best accuracy result: -- 1-nearest neighbor: 93.7% -- trained on 200 instances, tested on the other 169 -- Also of interest: -- Using only typical instances: 92.2% (storing only 23.1 instances) -- trained on 200 instances, tested on the other 169 4. Relevant Information: Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself: Group 1: 367 instances (January 1989) Group 2: 70 instances (October 1989) Group 3: 31 instances (February 1990) Group 4: 17 instances (April 1990) Group 5: 48 instances (August 1990) Group 6: 49 instances (Updated January 1991) Group 7: 31 instances (June 1991) Group 8: 86 instances (November 1991) ----------------------------------------- Total: 699 points (as of the donated datbase on 15 July 1992) Note that the results summarized above in Past Usage refer to a dataset of size 369, while Group 1 has only 367 instances. This is because it originally contained 369 instances; 2 were removed. The following statements summarizes changes to the original Group 1's set of data: ##### Group 1 : 367 points: 200B 167M (January 1989) ##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805 ##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record ##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial ##### : Changed 0 to 1 in field 6 of sample 1219406 ##### : Changed 0 to 1 in field 8 of following sample: ##### : 1182404,2,3,1,1,1,2,0,1,1,1 5. Number of Instances: 699 (as of 15 July 1992) 6. Number of Attributes: 10 plus the class attribute 7. Attribute Information: (class attribute has been moved to last column) # Attribute Domain -- ----------------------------------------- 1. Sample code number id number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7. Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 - 10 11. Class: (2 for benign, 4 for malignant) 8. Missing attribute values: 16 There are 16 instances in Groups 1 to 6 that contain a single missing (i.e., unavailable) attribute value, now denoted by "?". 9. Class distribution: Benign: 458 (65.5%) Malignant: 241 (34.5%)
Priority search tree demo was my final project for Brown’s Computational Geometry course. A priority search tree is a data structure that allows for efficient searching and point location in one and one-half dimensions (the upper bound on Y is missing). It is a hybrid of a heap and a balanced search tree (heap for y-coordinates, search tree for x-coordinates).
wristsavr — forces you to take a break.
wristsavr saves your wrists: it periodically zwrites & xlocks your screen to remind you to take a 2 minute break. This is a little ditty that I whipped up last night to avoid working on my operating system.
usage: wristsavr [-hb] [-m mins] -h Display usage information. -b Bully mode. If xlock terminates before two minutes, xlock the screen again. -m mins wait mins minutes between wristsavr notices (defualt 45)