Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives

Jose C. Principe

Verlag Springer-Verlag, 2010

ISBN 9781441915702 , 448 Seiten

Format PDF, OL

Kopierschutz Wasserzeichen

Geräte

Kapitelübersicht
Kurzinformation
Inhaltsverzeichnis
Leseprobe
Blick ins Buch
Fragen zum eBook

"9 A Reproducing Kernel Hilbert Space Framework for ITL (p. 351-352)

9.1 Introduction

During the last decade, research on Mercer kernel-based learning algorithms has ?ourished [226,289,294]. These algorithms include, for example, the support vector machine (SVM) [63], kernel principal component analysis (KPCA) [289], and kernel Fisher discriminant analysis (KFDA) [219]. The common property of these methods is that they operate linearly, as they are explicitly expressed in terms of inner products in a transformed data space that is a reproducing kernel Hilbert space (RKHS).

Most often they correspond to nonlinear operators in the data space, and they are still relatively easy to compute using the so-called “kernel-trick”. The kernel trick is no trick at all; it refers to a property of the RKHS that enables the computation of inner products in a potentially in?nite-dimensional feature space, by a simple kernel evaluation in the input space. As we may expect, this is a computational saving step that is one of the big appeals of RKHS.

At ?rst glance one may even think that it defeats the “no free lunch theorem” (get something for nothing), but the fact of the matter is that the price of RKHS is the need for regularization and in the memory requirements as they are memory-intensive methods. Kernel-based methods (sometimes also called Mercer kernel methods) have been applied successfully in several applications, such as pattern and object recognition [194], time series prediction [225], and DNA and protein analysis [350], to name just a few.

Kernel-based methods rely on the assumption that projection to the highdimensional feature space simpli?es data handling as suggested by Cover’s theorem, who showed that the probability of shattering data (i.e., separating it exactly by a hyperplane) approaches one with a linear increase in space dimension [64].

In the case of the SVM, the assumption is that data classes become linearly separable, and therefore a separating hyperplane is su?cient for perfect classi?cation. In practice, one cannot know for sure if this assumption holds. In fact, one has to hope that the user chooses a kernel (and its free parameter) that shatters the data, and because this is improbable, the need to include the slack variable arises. The innovation of SVMs is exactly on how to train the classi?ers with the principle of structural risk minimization [323]."