New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Soft and hard classification by reproducing kernel Hilbert space methods

Contributed by Grace Wahba
Abstract
Reproducing kernel Hilbert space (RKHS) methods provide a unified context for solving a wide variety of statistical modelling and function estimation problems. We consider two such problems: We are given a training set {y_{i}, t_{i}, i = 1, … , n}, where y_{i} is the response for the ith subject, and t_{i} is a vector of attributes for this subject. The value of y_{i} is a label that indicates which category it came from. For the first problem, we wish to build a model from the training set that assigns to each t in an attribute domain of interest an estimate of the probability p_{j}(t) that a (future) subject with attribute vector t is in category j. The second problem is in some sense less ambitious; it is to build a model that assigns to each t a label, which classifies a future subject with that t into one of the categories or possibly “none of the above.” The approach to the first of these two problems discussed here is a special case of what is known as penalized likelihood estimation. The approach to the second problem is known as the support vector machine. We also note some alternate but closely related approaches to the second problem. These approaches are all obtained as solutions to optimization problems in RKHS. Many other problems, in particular the solution of illposed inverse problems, can be obtained as solutions to optimization problems in RKHS and are mentioned in passing. We caution the reader that although a large literature exists in all of these topics, in this inaugural article we are selectively highlighting work of the author, former students, and other collaborators.
1. Introduction
For this article we define “soft classification” as the problem of classifying an object, based on a vector t of its attributes, into one of two or more categories, where it is desired to estimate the probability p_{j}(x) that the correct identification is category j. We define “hard classification” as the problem of classification where these probabilities are not of primary interest, for example, in cases where the object is easily classifiable by a human. An example of the first problem might be: Given the medical record of a subject, consisting of age, sex, blood pressure, body mass index, cholesterol level, and other relevant variables, provide the 10year probability of a heart attack. Here k = 2 (heart attack or not), and one might want a soft classification model for use in “evidencebased” medicine. The physician can refer to this model and tell a patient, for example, that if they change some of these variables by specific amounts they reduce the estimated 10year risk of a heart attack by so much. Examples of the second problem include character and speech recognition and the recognition of objects in images. In these cases the potential for highly accurate classification is there (because humans can do it), but an estimate of the probability of correct classification is generally of interest primarily as a tool in evaluating or comparing the efficacy of competing classifiers. There are, of course, intermediate situations for which the probability of correct classification is mostly high, but it is desired to identify examples for which the classification may be dubious. Other examples include cases for which the number of potential variables available for classification is much higher than the number of samples available for study, i.e., gene expression data. In this case it might be desired to estimate a probability, but there is not enough information to do it accurately. A modest number of statistical tools are available in the literature for soft classification, whereas hard classification techniques are ubiquitous. Descriptions of many of the latter techniques may be found in the Proceedings of the Neural Information Processing Society available at http://nips.djvuzone.org.
Together with colleagues and students I have spent a number of years developing tools for soft classification based on solving optimization methods in reproducing kernel Hilbert spaces (RKHS, to be described). The hard classification technique known as support vector machines (SVMs; refs. 1–3) has recently become popular in the classification literature and in practice. Although the original arguments deriving the SVM were quite different, it is well known that they may be obtained as solutions to optimization problems in RKHS (see refs. 4–6).
In this inaugural article we compare and contrast hard and soft classification methods that are obtained as solutions to optimization problems in RKHS with reference (essentially only) to published and inprogress work by the present author and collaborators despite the fact that there are many other important papers. The focus here is on classification problems with a small number of well defined categories, as opposed to problems such as speech or handwriting recognition.
We will describe RKHS in terms that do not require understanding of Hilbert spaces. Then we will identify the optimization problems of interest, those for penalized likelihood (soft classification) and SVMs (hard classification), give some examples, and comment on problems of computing and tuning the methods and the comparative advantages and disadvantages of both. Other related (hard) classification methods will be noted also, as will some of the other problems that can be solved via optimization problems in RKHS.
2. RKHS
We first need to define positive definite functions. Let 𝒯 be an index set; for example 𝒯 may be the unit interval or the unit cube, Euclidean dspace, or a finite set, say {1, 2, … , N}. Let s, t ∈ 𝒯. K(s, t) is said to be (strictly) positive definite if, for any ℓ and any distinct t_{1}, … , t_{ℓ} ∈ 𝒯, the ℓ × ℓ matrix with j, k entry K(t_{j}, t_{k}) is (strictly) positive definite, that is, ∑_{j,k} a_{j}a_{k}K(t_{j}, t_{k}) > 0 for any nonzero a_{1}, … , a_{ℓ}. Fix t_{∗} ∈ 𝒯 and let K_{t∗}(s) be the function of s defined by K_{t∗}(s) ≡ K(t_{∗}, s). If 𝒯 = {1, … , N}, then this function is simply an N vector; if 𝒯 is the unit interval, then this function is defined on the unit interval, etc. It is a theorem (see ref. 7) that 〈K_{t∗}, K_{t∗∗}〉 = K(t_{∗}, t_{∗∗}) defines an inner product and hence a distance (norm) on the class of all finite linear combinations of functions of this form. (This relationship is also the source of the term “reproducing kernel.”) Thus this class of functions has a geometry, which includes the notion of orthogonal projections. ∥f − g∥ = 〈f, f〉 −2 〈f, g〉 + 〈g, g〉 and the reader may think of 〈f, g〉/∥f∥_{ℋK}∥g∥_{ℋK} as the cosine of the angle between f and g. The (uniquely determined) collection of all finite linear combinations of these functions and their limits in this norm constitute an RKHS, which is uniquely determined by K. Hereafter this space will be called ℋ_{K} (see refs. 7 and 8). The geometry of projecting one element of this space onto a subspace of ℋ_{K} spanned by a finite number n of elements of ℋ_{K} proceeds just as it would in projecting an element in Euclidean space onto an n dimensional subspace using all the known inner products.
We describe RKHS at the above level of abstraction to emphasize how general the concept is, but in the sequel we will only be concerned with a couple of examples. A simple example is a kernel K associated with the space of periodic functions on [0, 1], which integrate to 0 and have square integrable second derivative. It is K(s, t) = B_{2}(s)B_{2}(t)/(2!)^{2} − B_{4}(s − t)/4!, where s, t ∈ [0, 1], and B_{m} is the mth Bernoulli polynomial (see ref. 8) where other spline kernels may be found. The square norm is known to be ∫(f"(s))^{2}ds, and the K_{t∗} are splines. Another popular kernel is the Gaussian kernel, K(s, t) =exp(−(1/σ^{2})∥s − t∥^{2}) defined for s, t in Euclidean d space, E^{d}, where the norm in the exponent is the Euclidean norm. Elements of this space are generated from functions of the form f_{t∗}(s) =exp(−(1/σ^{2})∥s − t_{∗}∥^{2}), t_{∗} ∈ E^{d}. The square norm of a function in the RKHS associated with this space involves division of the function's squared Fourier transform by the Fourier transform of the Gaussian function, but we do not need to know that here. However, it is useful to know that tensor sums and products of positive definite functions are also positive definite, which allows building positive definite functions on tensor products of all kinds of domains. The trivial case in 𝒯 = E^{d} is K(s, t) = s′t, and then ℋ_{K} is the ddimensional space of homogeneous linear functions, and ∥K_{t∗}∥= ∥t_{∗}∥. The original SVMs were linear and corresponded to this case.
We are now ready to write a (very special case of) a general lemma about optimization problems in RKHS (6).
Lemma.
Given observations {y_{i}, t_{i}, i = 1, 2, … , n}, where y_{i}is a real number and t_{i} ∈ 𝒯, and given K and (possibly) given some particular functions{φ_{1}, … , φ_{M}} on 𝒯, find f of the form f(s) = ∑ d_{v}φ_{v}(s) + h(s) where h ∈ ℋ_{K} to minimizewhere 𝒞 is a convex function off(t_{i}). It is assumed that the minimizer of𝒞(y_{i}, f(t_{i})) in the span of the {φ_{v}} is unique. Then the minimizer ofℐ{f, y} has a representation of the form The coefficient vectors d = (d_{1}, … , d_{M})′ and c = (c_{1}, … , c_{n})′ are found by substituting Eq. 2 into the first term in Eq. 1 and using the fact that ∥∑c_{i}K(t_{i},⋅)∥= c′K_{n}c, where K_{n} is the n × n matrix with i, jth entry K(t_{i}, t_{j}). The minimization generally has to be done numerically by an iterative descent method except in the case where 𝒞 is quadratic in f, in which case a linear system has to be solved.
When K(⋅, ⋅) is a smooth function of its arguments and n is large, it has been found that excellent approximations to the minimizer of Eq. 1 for various 𝒞 can be found with functions of the form where the t_{i1}, … , t_{iL} are a relatively small subset of t_{1}, …, t_{n}, thus reducing the computational load. The t_{i1}, … , t_{iL} may be chosen in various ways: as a random subset, by clustering the t_{i} and selecting from each cluster (9), or by a greedy algorithm (as for example in ref. 10), depending on the problem.
3. Penalized Likelihood Estimation: Two Categories
We first consider only two categories, or outcomes, labeled 1 and 0; for example, 1 is the outcome that a subject gets a disease and 0 is the outcome that they do not. Let p(t) be the probability that a subject with attribute vector t ∈ 𝒯 gets the disease and define the log odds ratio f(t) = log[p(t)/(1 − p(t)]. The likelihood function for data {y_{i}, t_{i}} is p(t_{i}) if y_{i} = 1 and (1 − p(t_{i})) if y_{i} = 0, which may be written p(t_{i})^{yi}(1 − p(t_{i}))^{1−yi}. Substituting f(t_{i}) into the likelihood function, we obtain the negative log likelihood Given a training set {y_{i}, t_{i}, i = 1, … n}, an estimate for f and hence p may be obtained by setting 𝒞(y_{i}, f(t_{i})) in Eq. 1 to be ℒ(y_{i}, f(t_{i})) of Eq. 4. This kind of penalized likelihood estimate goes back at least to ref. 11.
Let t ∈ [0, 1]^{d}, the unit cube, and let t = (x_{1}, …, x_{d}) and t_{i} = (x_{i1}, … , x_{id}). Under general conditions we can expand any integrable f(t) as follows: f(t) = μ + ∑_{α} f_{α}(x_{α}) + ∑_{α<β} f_{αβ}(x_{α}, x_{β}) + ⋯ + f_{αβ…}(x_{1}, … , x_{d}), where the components satisfy side conditions that make them identifiable and generalize the usual side conditions for parametric ANOVA, that is, all averages with respect to each x_{α} are 0 (see ref. 12). The series is truncated in some manner. By using weighted tensor sums and products of reproducing kernels, the theory above may be generalized to replace λ∥f∥ in Eq. 1 by ∑_{α} λ_{α} J_{α}(f_{α}) + ∑_{α<β}λ_{αβ}J_{αβ}(f_{αβ}) + ⋯, where J_{α}, J_{αβ}, … are square norms in the component subspaces. Details may be found in refs. 8, 12, and 13. The λ_{α}, λ_{αβ}, … are smoothing parameters to be chosen. The right panel in Fig. 1 gives the estimated 4year risk of progression of diabetic retinopathy based on data from the Wisconsin Epidemiologic Study of Diabetic Retinopathy (WESDR) (14). The attribute vector is t = (x_{1}, x_{2}, x_{3}) = (duration of diabetes, glycosylated hemoglobin, body mass index), and was observed in a group of n = 669 subjects at the start of the study, and the outcome (progression or not) was observed at the 4year followup and coded as y = 1 or 0. The plot was based on the model f(x_{1}, x_{2}, x_{3}) = μ + f_{1} (x_{1}) + d_{2}x_{2} + f_{3}(x_{3}) + f_{13}(x_{1}, x_{3}). The figure is plotted for x_{2} fixed at its median. The software that produced this analysis can be found in the code grkpack (15). The method for choosing smoothing parameters behind this plot is discussed in ref. 12 (see also ref. 16). A later method, which is used in some subsequent work, is called generalized approximate cross validation (GACV) (17, 18) and will be discussed further in Section 8. An important problem is the selection of terms that are to be retained in the model. This was done in an informal manner in ref. 12. A recent approach to the problem of selecting terms is discussed in refs. 19 and 20.
This penalized likelihood method provides an estimate p_{λ} for p which is known to converge to the true p as the sample size becomes large, under various conditions, including a good choice of the λ (see refs. 13, 21, and 22).
4. SVM Classification: Two Categories
Suppose that we have two categories as before, but we are interested only in making a decision as to which category a future object with attribute vector t_{∗} is in. It will be convenient for this purpose to change the coding. We let y_{i} = 1 if the ith subject or object from the training set is in the first category, and y_{i} = −1 if it is in the second category. It can be shown that with this coding the negative log likelihood function becomes Fig. 2 gives a plot of log(1 + e^{−τ}) as a function of τ. Now we are not interested in the complete function p(t) but only in obtaining enough information to make a classification decision. In the socalled standard case where the costs of both kinds of misclassification are equal and the training set is representative of the population to be classified in the future, the optimal classifier for a subject with attribute vector t_{∗} is determined by whether p(t_{∗}) is greater or less than 1/2, equivalently, whether f(t_{∗}) is positive or negative. A member of the training set with label y_{i} will be classified correctly by f_{λ}(t_{i}) if τ_{i} ≡ y_{i}f_{λ}(t_{i}) is positive and incorrectly if τ_{i} is negative. Thus, if only classification is of interest, one could consider letting 𝒞(y_{i}, f(t_{i})) of Eq. 1 be the ∗ function [−y_{i}f(t_{i})]_{∗}, where [τ]_{∗} = 1 if τ > 0 and 0 otherwise. Then the first term on the righthand side of Eq. 1 would simply count the number of misclassified subjects in the training set. [−τ]_{∗} is plotted in Fig. 2. However, ℐ of Eq. 1 with 𝒞 the ∗ function can have multiple minima and is difficult to compute, because it is not convex. Define the “plus” function (τ)_{+} = τ if τ > 0 and 0 otherwise. The socalled hinge function (1 − τ)_{+} is also plotted in Fig. 2 and is seen to be the closest convex upper bound to the ∗ function that has a slope of −1 at 0. Setting 𝒞(y_{i}, f(t_{i})) in Eq. 1 to be (1 − y_{i}f(t_{i}))_{+} and setting M = 1 and φ_{1} (s) = 1 in Eq. 2 results in the optimization problem, the minimizer of which is known as an SVM. In the last few years the SVM has become very popular in the machine learning community for classification problems, because in practice it has been empirically observed to provide excellent classification results in a wide variety of applications. It is to be noted that once the reproducing kernel K(s, t) is defined, the nature of the domain of s and t is not relevant to the calculations, because only values of K(s, t) are needed to define the SVM. Thus problems where s and t are in a high dimensional space while only a small number of samples are available may be treated, at least mathematically, in a unified manner. Up until recently a theoretical understanding of what the SVM was actually estimating was not available. Recently Lin (23) has shown that if the RKHS is sufficiently rich, then under certain circumstances the SVM is estimating the sign of (p − 1/2), equivalently, the sign of f. This is exactly what you need to classify according to Bayes rule. This is illustrated in Fig. 3. To obtain Fig. 3, Lin let p(t) = Pr(Y = 1t) = 2t, for t ∈ [0, 0.5] and 1 − 2t for t ∈ [0.5, 1]. Thus sign[p(t) − 1/2] = 1, t ∈ (0.25, 0.75), −1, otherwise, which is also the desired sign of f = log[p/(1 − p)]. n = 257 equally spaced values of t_{i} on the unit interval were selected, and y_{i} was generated randomly to be 1 with probability p(t_{i}) and −1 with probability 1 − p(t_{i}). A sufficiently rich spline kernel was chosen to compute the 25 SVMs in Fig. 3 for 25 values of log_{2}λ from −1 to −25, left to right, top to bottom. It can be seen that for log_{2}λ in the neighborhood of −13 to −18, the estimate is very close to signf. λ may be chosen by the GACV method for SVMs (see ref. 24 and Section 8).
5. Penalized Likelihood Estimation: Multiple Categories
Suppose now that there are k + 1 possible outcomes, with k > 1. Let p_{j}(t), j = 0, 1, … , k be the probability that a subject with attribute vector t is in category k, ∑ p_{j}(t) = 1. The following approach was proposed in ref. 25: Let f^{j}(t) =log[p_{j}(t)/p_{0}(t)], j = 1, … , k. Then The class label for the ith subject is coded as y_{i} = (y_{i1}, … , y_{ik}), where y_{ij} = 1 if the ith subject is in class j and 0 otherwise. Letting f = (f^{1}, … , f^{k}) the negative log likelihood can be written as where the h^{j} can have an ANOVA decomposition as in Section 3. Generally the f^{j} will have the same terms, with their own d, c. Then λ∥h∥ in Eq. 1 is replaced by Fig. 4 is based on 10year mortality data of a group of n = 646 subjects from the WESDR study. Their age (x_{1}), glycosylated hemoglobin (x_{2}), and systolic blood pressure (x_{3}) were (among other things) recorded at baseline, and they were divided into four categories with respect to their status after 10 years, as 0 = alive, 1 = died of diabetes, 2 = died of heart disease, and 3 = died of other causes. Each of the f^{j}, j = 1, 2, 3 was modeled as f^{j}(x_{1}, x_{2}, x_{3}) = μ^{j} + f(x_{1}) + f(x_{2}) + f(x_{3}) + f(x_{2}, x_{3}). The p_{j}, j = 0, … , 3 were estimated by minimizing ℐ(y, f) = Eq. 8 + Eq. 9, and the multiple smoothing parameters were estimated by the GACV for polychotomous data (25). For the figure, x_{2} and x_{3} were set at their medians, and the differences between adjacent curves, from bottom to top, are probabilities for categories 0, 1, 2, and 3, respectively.
6. SVMs: Multiple Categories
Thus far, we have been assuming that the observational or training data {y_{i}, t_{i}} is a representative sample from the population of interest. We return to the classificationonly problem, now with several categories. If the cost of misclassification is the same for each of the categories of interest, then the optimum classifier would choose category k if p_{k}(t) is larger than p_{ℓ}(t) for each ℓ not equal to k. We first consider the socalled standard case, where the cost of misclassification is the same for each category, and as before the training sample is representative. In the next section we will consider the nonstandard case, for which these conditions are relaxed. Many classification problems involve more than two categories, and most authors use some version of oneversusmany or multiple pairwise comparisons (see refs. 1 and 26 for example). Recently a generalization of the SVM that treats all categories simultaneously and symmetrically has been obtained in ref. 27, called the multicategory SVM (MSVM). In the MSVM we have k categories labeled as j = 1, … , k. The class label y_{i} will be coded as a k dimensional vector with 1 in the jth position if example i is in category j and −(1/k−1) otherwise. For example y_{i} = (1, −(1/k−1),…, −(1/k−1)) indicates that the ith example is in category 1. We define a ktuple of separating functions f(t) = (f^{1}(t), … , f^{k}(t)), with each f^{j} = d^{j} + h^{j} with h^{j} ∈ ℋ_{K}, and which will be required to satisfy a sumtozero constraint, ∑f^{j}(t) = 0, for all t in 𝒯. Note that unlike the estimate of Section 5, all categories are treated symmetrically.
Let L_{jr} = 1, r ≠ j, L_{jj} = 0, j, r = 1, … , k. Let cat(y_{i}) = j if y_{i} is from category j. Then, if y_{i} is from category j, L_{cat(yi}_{)r} = 0 if r = j and 1 otherwise. Then the MSVM is defined as the vector of functions f_{λ} = (f, … , f), with each h^{k} in ℋ_{K} satisfying the sumtozero constraint, which minimizes Generalizations of the penalty term are possible if necessary. It can be shown that the k = 2 case reduces to the usual twocategory SVM just discussed, and it is shown in ref. 27 that the target for the MSVM is f(t) = (f^{1}(t), … , f^{k}(t)) with f^{j}(t) = 1 if p_{j}(t) is bigger than the other p_{l}(t) and f^{j}(t) = −(1/k−1) otherwise. Fig. 5 describes a simulated example to suggest this result. In Upper Left are given p_{j}(t), j = 1, 2, 3, and in Upper Right, Lower Right, and Lower Left the three optimum f^{j} are superimposed on the p_{j}. The f^{j} take on only the values 1 and −(1/2) ≡ −(1/k−1). For the experiment n = 200, values of t_{i} were chosen according to a uniform distribution on the unit interval, and the class labels were generated according to the p_{j}. Fig. 6 gives the estimated f^{1}, f^{2}, and f^{3}. The Gaussian kernel was used. In Fig. 6 Left λ and σ were chosen with the knowledge of the “right” answer. It is strongly suggestive that the target functions are as claimed. In Fig. 6 Right, both λ and σ^{2} were chosen by the GACV for the MSVM, given in refs. 28 and 29, which is a generalization of the GACV for the twocategory SVM previously given in refs. 24 and 30. For comparison purposes, λ and σ^{2} were chosen by fivefold cross validation on the MSVM functional (first sum in Eq. 10), with very similar results (not shown). In this example, λ was chosen somewhat larger than its optimum value both by the GACV and the fivefold cross validation, but it can be seen that the implied classifier is quite accurate nevertheless. This is the kind of example where the MSVM will beat a onevs.many twocategory SVM: category 2 would be missed, because the probability of category 2 is less than the probability of not 2 over a region, although it is the most likely category there.
Two interesting applications of the MSVM appear in refs. 29 and 31 and Y. Lee, G.W., and S. Ackerman, in preparation. The first, also discussed in ref. 28, concerns the application of the MSVM to the classification of a set of cDNA profiles. There were four categories of profiles from small round blue cell tumors (SRBCTs) of childhood. The training set consisted of 63 samples falling into the four categories. Initially, each sample consisted of a vector of expression values for 2,308 genes but was first reduced to the 100 most informative expression values by simple genebygene tests, and then three principal components were extracted from the 100 such that the t vectors are of dimension 3. Scatter plots of the three components suggested that this is a relatively easy classification problem, using only three dimensional vectors extracted from the initial 2,308 dimensional vectors. To study the efficacy of the MSVM, a test set of 20 SRBCT samples and 5 nonSRBCT samples were used. All of the 20 testsample SRBCTs were classified correctly, and the classification of the 5 nonSRBCTs was ambiguous in the sense that no one of the four components of the MSVM was very large. A measure of the strength of the classification based on how close the MSVM is to a target vector [that is, one with entries 1 and −1/(k − 1)] was obtained in ref. 29, and it indicated that all the classifications of the 5 nonSRBCT samples were weak, along with 3 of the 20 SRBCT classifications. The second example from ref. 29 and Y. Lee, G.W., and S. Ackerman (in preparation) concerns the classification of satelliteobserved radiance profiles that contained one of two types of clouds or no clouds. The data came from a simulation system that generates model atmospheric profiles and the response of the observing instrument to them and consisted of 744 profiles, each containing a 12vector of the responses from 12 different instrument channels. The data set was divided randomly in half for a training set and a test set, and the threecategory MSVM was applied to the test set. It was clear there was some modest overlap in data from the three classes. Two cases were tried where the 12vector was reduced to 2 and then to 5 components nonlinearly, using domain knowledge; then in a third analysis the original 12vector was used, resulting in a slightly worse testerror rate than the previous two cases. But transforming the 12vectors into their logs turned out to be the best of the four. Both of these examples illustrate an open question discussed by various authors: How best to choose dimension reducing, possibly nonlinear transformations on the observations to improve classification rates.
7. The Nonstandard MSVM
We now consider the case where the proportion π, j = 1, … , k of examples in the training set in each category is not representative of the proportion π_{j} in the population as a whole, and the costs of misclassification are different for different mistakes. Let C_{jr} be the cost of classifying a j subject as an r, with C_{jj} = 0. Then the Bayes rule (which minimizes expected cost) is to choose the j for which ∑C_{ℓj}p_{ℓ}(t) is minimized, where p_{ℓ}(t) is the probability that a subject (in the population as a whole) with attribute vector t is in category ℓ. Now let p(t) be the probability that a subject in the training set (with proportions π of the different categories) with attribute vector t is in category j. Let now One then can show that the optimum classifier chooses the j for which ∑L_{ℓj}p(t) is minimized. The nonstandard MSVM is defined as the vector of functions f_{λ} satisfying the sumtozero constraint, which minimizes Eq. 10, with the L_{jr} there given by Eq. 11. It is shown in ref. 27 that the target of this SVM is: f_{j}(t) = 1 for the j which minimizes ∑C_{ℓj}p_{ℓ}(t) and −(1/k−1) otherwise. Further results and some “toy” examples in the twocategory nonstandard case are in ref. 32. Applications of the nonstandard SVM include, for example, evidencebased medical decision making, where different patients might have different personal “costs” with respect to the risks of erroneous treatment decisions. Similarly, both the costs and relative frequencies of the classes in the satellite profile problem discussed previously may be nonstandard, because if the system were to be automated in a numerical weather prediction model, the different kinds of misclassifications may affect the system differently, with more or less costly consequences. When the classes are well separated, taking costs into account will not make much difference, but in modest overlap cases, they can. The GACV for the nonstandard twocategory SVM is given in ref. 32 and for the MSVM in ref. 29.
8. Choosing the Smoothing and Tuning Parameters
Methods for choosing the smoothing/tuning parameters in optimization problems such as Eq. 1 have generated lively interest in statistical and machine learning circles for some time. The λs control the tradeoff between the first term on the right of Eq. 1, the fit to the observations, and the second, the complexity of the solution. For the penalized likelihood situation this is known as the biasvariance tradeoff, and this terminology has migrated to the hard classification case. One of the oldest methods, based on 𝒞(y_{i}, f(t_{i})) = (y_{i} − f(t_{i}))^{2} in Eq. 1 is the generalized cross validation (GCV), obtained in refs. 33 and 34. It is targeted at minimizing the predictive mean square error, the mean square difference between the estimate f_{λ} and the (unknown) truth, f_{true}, when y_{i} = f_{true}(t_{i}) + ɛ_{i}, where ɛ_{i} is white Gaussian noise with common unknown variance. The GCV may be derived beginning with a leavingoutone argument and an invariance argument and coincides with leavingoutone in special circumstances. Theoretical properties have been obtained in various places including refs. 8, 35, and 36. The GACV for Bernoulli (0, 1) data is targeted at the Kullback–Leibler distance from the estimated p_{λ} to the true p_{true}, a commonly used measure (but not actually a distance) between two distributions (see ref. 17). Both the GCV and the GACV require the computation of the trace of a difficult to compute matrix when n is large. Randomization techniques for doing this efficiently for the GCV were proposed in refs. 37 and 38 and for the GACV in ref. 18. The GACV for SVMs was obtained by starting with a similar leavingoutone argument followed by a series of approximations and is targeted at minimizing the expected value of (1 − y_{future}f_{λ})_{+} in the twocategory standard SVM case, where y_{future} is a new observation from the population under consideration. The ξα method of ref. 39 is similar to and behaves very much like the GACV (32), although the ξα method is targeted directly at the misclassification rate [−y_{future}f_{λ}]_{∗}, whereas the GACV is targeted at (1 − y_{future}f)_{+}, an upper bound on the misclassification rate. There are several other methods in the SVM literature that begin with the same leavingoutone argument or an upper bound for it (see ref. 40) with similar results. The ingredients for these are available when the SVM is computed, so no special calculations are required. Leaving out 1/2, 1/5, 1/10, and other crossvalidation procedures for estimating the tuning parameters in SVMs are popular, especially when there is a copious training set available.
9. Which Cost Function?
Returning to the twocategory case for ease of exposition, it can be asked when is it better to use the SVM or the penalized likelihood estimate. The penalized likelihood estimate can be used for classification in the standard and nonstandard cases (in the multicategory as well as the twocategory case) via the Neyman–Pearson Lemma. If probabilities are desired and several conditions are met, then the penalized likelihood estimate is the more appropriate. These conditions generally include a large data set relative to the number of dimensions and probabilities that are expected to be bounded away from 0 or 1 in regions of interest (because the true f will tend to infinity as p tends to 0 or 1). For classification, the SVM does not suffer from either of these problems and furthermore tends to give a sparse solution, that is, many c_{i} are 0, at the cost of a lack of direct interpretability of classification results that are “weak.” Alternatively, the penalized likelihood estimate can somewhat mitigate these problems for classification by an appropriate thinning of the basis functions. Letting the “cost function” c(τ) be 𝒞(y_{i}, f(t_{i})) with τ = y_{i}f(t_{i}), a hybrid 𝒞 may be defined, which combines features of the SVM and the penalized likelihood estimate by letting c(τ) = ln(1 + e^{−τ}) for −∞ ≤ τ ≤ θ and c(τ) extrapolated linearly for τ > θ by matching c(τ) and c′(τ) at θ and linearly continuing until c(τ) becomes 0, after which it remains 0. With a judicious choice of θ > 0, this c might combine the best properties of the SVM and the penalized likelihood, having the sparsity and ease with p near 0 or 1 of the SVM while estimating the log odds ratio and hence the probability near the intermediate case. Other cs for the classification problem have been proposed by various authors including (1 − τ), where q is some power greater than 1. An argument was provided recently for c(τ) linear with a negative slope for τ less than θ, joined smoothly on the right to c(τ) = 1/τ for τ > θ (M. Todd and S. Marron, personal communication). 𝒞(y_{i}, f(t_{i})) = (y_{i} − f(t_{i}))^{2}, a.k.a. penalized least squares, a.k.a. regularized least squares, a.k.a. ridge regression, long known in the statistics literature for regression problems (see e.g. the references in ref. 34), corresponds to c(τ) = (1 − τ)^{2} in the case where y_{i} = ±1. Several authors have proposed it in the classification context (T. Poggio, personal communication, and O. Mangasarian, personal communication),† although different names have been attached to it in some cases. Poggio's group found in the cases tried that it compared in classification accuracy with the SVM. It is the easiest to compute, requiring only the solution of a linear system. The extension to the multicategory case is straightforward, replacing (f^{r}(t_{i}) − y_{ir})_{+} in Eq. 10 by (f^{r}(t_{i}) − y_{ir})^{2}, and imposing the sumtozero constraint by requiring the coefficients in the estimates to satisfy a linear system subject to linear equality constraints. A number of other choices of c are discussed in ref. 41, where it is shown for the twocategory standard case that under very weak assumptions on c the resulting solution will tend to have the same sign as (p − 1/2).
Classification problems may have few to extremely many variables (our examples here are the few variables case), may have few to extremely many observations available for a training set, and may be very easy to fairly difficult to classify; “fairly difficult” may include data from the different classes overlapping and/or atypical samples. It is safe to claim, bolstered by theory as well recent simulation results of various authors, that no one c or 𝒞 is going to dominate all others over the range of classification problems. The choice of K can be important or unimportant depending on the example. The Gaussian kernel appears to be a good general purpose kernel for classification in many examples. Other examples of radial kernels (that is, depending on ∥s − t∥) may be found in ref. 42, at ftp://ftp.stat.wisc.edu/pub/wahba/talks/nips.96/mc.talk.ps, and elsewhere. Some information related to the sensitivity/insensitivity of solutions to optimization problems in RKHS to various parameters in K may be found in chapter 3 of ref. 8. Especially if ℋ_{K} is a space of flexible functions it is necessary to control the biasvariance tradeoff; choices here may well be the most important. This tradeoff involves the λs, but other choices may also be important. When the Gaussian kernel is used with the SVM, the results are generally sensitive to the choice of σ, which therefore must also be tuned.
10. Concluding Remarks
In the last few years there has been an explosion of classification methods and results, both theoretical and practical, that are related to optimization problems in RKHS. SVMs and related methods have become the method of choice in many classification applications as their properties are becoming known. In each problem a cost function, a kernel K, and a tuning method must be selected, along with the sometimes nontrivial problem of choosing a numerical algorithm. The trick of thinning the representers can sometimes be used to assist in the biasvariance tradeoff while at the same time making the calculations easier. Early stopping of iterative methods, which we haven't discussed, can also help in this tradeoff (43, 44). Relative sensitivity of the results to the various choices is an active area of recent research. Penalized likelihood methods for regression and function estimation, with data involving random variables from various known distributions, are older (11). The numerical solution of illposed inverse problems, where the data are modeled as y_{i} = ∫ F(t_{i}, s)f(s)ds + ɛ_{i}, where F(t, s) is given, ɛ_{i} are noise variables, and it is desired to recover f, may proceed by replacing 𝒞(y_{i}, f(t_{i})) by 𝒞(y_{i}, ∫ F(t_{i}, s)f(s)ds) in Eq. 1. Then the K(t_{i}, ⋅) in Eq. 3 are replaced by (other) socalled representers, which may be found in refs. 6 and 8 (see also refs. 45–47). Other references can be found via the publication list on my web site (www.stat.wisc.edu/^{∼}wahba).
Optimization methods in RKHS have turned out to be useful in a wide variety of problems in statistical model building, machine learning, curve and surface fitting, illposed inverse problems, and elsewhere. Technical reports and Ph.D. theses since mid1993 are available via the TRLIST link on my web site.
Acknowledgments
I owe a great debt to my colleague Yi Lin in the Statistics Department at the University of Wisconsin (Madison) and to my most recent former students (now assistant professors) Yoonkyung Lee (Ohio State University, Columbus) and Hao Helen Zhang (North Carolina State University, Raleigh). I owe much to Drs. Ron Klein and Barbara E. K. Klein of the University of Wisconsin Ophthalmology Department for many fruitful discussions that suggested some of the problems studied here and for making data from the WESDR study available, and to former student Xiwu Lin (now at GlaxoSmithKline) for Fig. 4 and the work behind it. I am grateful to my many other former students who had a hand in earlier problems involving optimization problems in RKHS and to many mentors, including especially my thesis advisor, Manny Parzen, who introduced me to RKHS. This research was supported by National Institutes of Health Grant RO1 EY09946, National Science Foundation Grant DMS 0072292, and National Aeronautics and Space Administration Grant NAGW 10273.
Footnotes

↵* Email: wahba{at}stat.wisc.edu.

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected on May 2, 2000.

↵† Poggio, T., Mukherjee, S., Rifkin, R. & Suykens, J., Foundation of Computational Mathematics 2002 Meeting, Aug. 5–14, 2002, Minneapolis, MN.
Abbreviations

RKHS, reproducing kernel Hilbert space

SVM, support vector machine

WESDR, Wisconsin Epidemiologic Study of Diabetic Retinopathy

GACV, generalized approximate cross validation

GCV, generalized cross validation

MSVM, multicategory SVM

SRBCT, small round blue cell tumor
 Accepted September 23, 2002.
 Copyright © 2002, The National Academy of Sciences
References
 ↵
 Vapnik V.

 Scholkopf B.

 Cristianini N.
 ↵
 Wahba G.
 ↵
 ↵
 ↵
 Wahba G.
 ↵
 Xiang D.
 ↵
 ↵
 ↵
 ↵
 Gu C.
 ↵
 Klein R.
 ↵
 ↵
 ↵
 Xiang D.
 ↵
 Wahba G.
 ↵
 Zhang H.
 ↵
Zhang, H., Wahba, G., Lin, Y., Voelker, M., Ferris, M., Klein, R. & Klein, B. (2000) Tech. Rep. 1059 (Dept. of Stat., Univ. of Wisconsin, Madison).
 ↵
 ↵
 ↵
 ↵
 Wahba G.
 ↵
 Lin X.
 ↵
 Ramaswamy S.
 ↵
Lee, Y., Lin, Y. & Wahba, G. (2002) Comput. Sci. 33, in press; Tech. Rep. 1043 (Dept. of Stat., Univ. of Wisconsin, Madison).
 ↵
Lee, Y. & Lee, C.K. (2002) Tech. Rep. 1051 (Dept. of Stat., Univ. of Wisconsin, Madison).
 ↵
 Lee Y.
 ↵
 ↵
Lin, Y., Lee, Y. & Wahba, G. (2002) Tech. Rep. 1063 (Dept. of Stat., Univ. of Wisconsin, Madison).
 ↵
 Wahba G.
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Joachims T.
 ↵
 ↵
Lin, Y. (2002) Tech. Rep. 1044 (Dept. of Stat., Univ. of Wisconsin, Madison).
 ↵
 ↵
 Wahba G.
 ↵
 ↵
Citation Manager Formats
Sign up for Article Alerts
Jump to section
 Article
 Abstract
 1. Introduction
 2. RKHS
 3. Penalized Likelihood Estimation: Two Categories
 4. SVM Classification: Two Categories
 5. Penalized Likelihood Estimation: Multiple Categories
 6. SVMs: Multiple Categories
 7. The Nonstandard MSVM
 8. Choosing the Smoothing and Tuning Parameters
 9. Which Cost Function?
 10. Concluding Remarks
 Acknowledgments
 Footnotes
 Abbreviations
 References
 Figures & SI
 Info & Metrics