Download pdf - Transformation Knowledge in Pattern Analysis with Kernel Methods

Transformation Knowledge in Pattern Analysis with Kernel Methods - Distance and Integration Kernelsan der Albert-Ludwigs-Universitat Freiburg im Breisgau
Transformation Knowledge in Pattern Analysis with Kernel
Methods –
Dekan: Prof. Dr. Jan G. Korvink
Prufungskommission: Prof. Dr. Wolfram Burgard (Vorsitz) Prof. Dr. Luc De Raedt (Beisitz) Prof. Dr. Hans Burkhardt (Gutachter) Prof. Dr. Bernhard Scholkopf (Gutachter)
Datum der Disputation: 18. November 2005
Acknowledgement
Firstly, I want to thank my supervisor Prof. Dr.-Ing. Hans Burkhardt for giving me the possibility and wide support for the research which has led to this thesis. In particular, the excellent technical environment, the availability of various interesting application fields and the scientific freedom have combined to be an excellent basis for independent research. The generous support of research travel enabled me to establish many important and fruitful contacts. Similarly, I am deeply grateful to Prof. Dr. Bernhard Scholkopf who was a constant source of motivation through his own related work and various guiding hints, many of which find themselves realized in the present thesis. I am very glad that he agreed to act as the second referee. In particular, I am very thankful for being given the opportunity to visit his group for a talk, several weeks of research and the machine learning summer school MLSS 2003. During these occa- sions, many fruitful discussions were possible, especially with Dr. Ulrike von Luxburg, Matthias Hein and Dr. Olivier Bousquet. Large parts of the experiments were based on third party data which were kindly provided by Dr. Elzbieta Pekalska, Dr. Thore Graepel, Daniel Keysers and Rainer Typke. I also want to mention my former and current colleagues at the pattern recognition group who contributed through discussions, providing data and, last but not least, encouragement when required. The whole group and also the members of the associated group of Prof. Dr. Thomas Vetter provided a wonderful, friendly and personal atmosphere, which played a very important role for me. Therefore, I want to mention outstandingly Nikos Canterakis, Olaf Ronneberger, Dr.-Ing. Lothar Bergen, Dimitrios Katsoulas, Claus Bahlmann, Stefan Rahmann, Dr. Volker Blanz and Klaus Peschke. A big “thank you” also goes to three of my former students, Nicolai Mallig, Harald Stepputtis and Anselm Vossen, who all contributed through discussions, ideas, implementations and scientific results to the development of the subjects in three main chapters.
Last but not least, I dedicate the thesis to other important persons. On the one hand, to my parents, who supported the unhindered development of my work in various ways. On the other hand, to my girlfriend Heide, who also had to live with all the ups and downs of my work during the last several years, but always managed to remind me of other important things in life.
Kunheim, April 2005 Bernard Haasdonk
iii
iv
Zusammenfassung
Diese Dissertation konzentriert sich auf eine bestimmte Art von Vorwissen, namlich Vorwissen uber Transformationen. Dies bedeutet, dass explizite Kenntnis von Muster- variationen vorhanden ist, welche die inharente Bedeutung der Objekte nicht oder nur unwesentlich verandern. Beispiele sind rigide Bewegungen von 2D- und 3D-Objekten oder Transformationen wie geringe Streckung, Verschiebung oder Rotation von Buch- staben in der optischen Zeichenerkennung. Es werden mehrere generische Methoden prasentiert und untersucht, welche solches Vorwissen in Kernfunktionen berucksichti- gen.
1. Invariante Distanzsubstitutions-Kerne (IDS-Kerne): In vielen praktischen Fragestellungen sind die Transformationen implizit in aus- gefeilten Distanzmaßen zwischen Objekten erfasst. Beispiele sind nichtlineare De- formationsmodelle zwischen Bildern. Hier wurde eine explizite Parametrisierung der Transformationen beliebig viele Parameter benotigen. Solche Maße konnen in distanz- und skalarprodukt-basierte Kerne eingebracht werden.
2. Tangentendistanz-Kerne (TD-Kerne): Spezielle Beispiele der IDS-Kerne werden detaillierter untersucht, weil diese effizient berechnet und weit angewandt werden konnen. Wir setzen differenzier- bare Transformationen der Muster voraus. Bei solchem gegebenen Vorwissen kann man lineare Approximationen der Transformations-Mannigfaltigkeiten kon- struieren und mittels geeigneter Distanzfunktionen effizient zur Konstruktion von Kernfunktionen verwenden.
3. Transformations-Integrations-Kerne (TI-Kerne): Die Technik der Gruppen-Integration uber Transformationen zur Merkmalsextrak- tion kann in geeigneter Weise erweitert werden auf Kernfunktionen und allge- meinere Transformationen, die nicht notwendigerweise eine Gruppe bilden.
v
vi
Theoretisch unterscheiden sich diese Verfahren darin, wie sie die Transformationen reprasentieren und die Transformations-Weiten regelbar sind. Grundlegender erweisen sich Kerne aus Kategorie 3 als positiv definit, Kerne der Gattung 1 und 2 sind nicht positiv definit, was generell als notwendige Voraussetzung zur Verwendung in Kern- methoden angesehen wird. Dies war die Motivation dafur zu untersuchen, was die theoretische Bedeutung von solchen indefiniten Kernen ist. Das Ergebnis zeigt, dass diese Kerne auf gegebenen Daten Skalarprodukte in pseudo-euklidischen Raumen darstellen. In diesen haben bestimmte Kernmethoden, insbesondere die SVM, eine sinnvolle geometrische und theoretische Interpretation.
Zusatzlich zu theoretischen Eigenschaften wird die praktische Anwendbarkeit der Kerne demonstriert. Fur diese Experimente wurde Supportvektor-Klassifikation auf einer Vielzahl von Datensatzen durchgefuhrt. Diese Datensatze umfassen Standard- Benchmark-Datensatze der optischen Zeichenerkennung, wie USPS und MNIST, und biologische Anwendungsdaten, die aus der Raman-Mikrospektroskopie stammen und zur Identifikation von Bakterien dienen.
Zusatzlich zur Erkenntnis, dass Transformations-Wissen auf verschiedene Weise in Kernfunktionen eingebracht werden kann und diese praktisch anwendbar sind, gibt es grundlegendere Einsichten und Ausblicke. Wir demonstrieren und erlautern am Beispiel der SVM, dass indefinite Kerne in Kernmethoden verwendet oder toleriert werden konnen. Es existieren Aussagen uber den Trainings-Algorithmus und die Eigen- schaften der Losungen und eine sinnvolle geometrische Interpretation. Dies eroffnet im Wesentlichen zwei Richtungen. Erstens vereinfachen diese Einsichten den Prozess des Kerndesigns, welcher bislang hauptsachlich auf positiv definite Kerne beschrankt war. Insbesondere eroffnet dies die Moglichkeit der weiten Anwendbarkeit von SVM in an- deren Gebieten wie distanzbasiertem Lernen, d.h. fur Analyse-Probleme, bei denen Unterschiedsmaße zwischen Objekten verfugbar sind. Zweitens erscheint die Unter- suchung der Anwendbarkeit von indefiniten Kernen in weiteren Kernmethoden sehr vielversprechend.
Abstract
Modern techniques for data analysis and machine learning are so called kernel methods. The most famous and successful one is represented by the support vector machine (SVM) for classification or regression tasks. Further examples are kernel principal component analysis for feature extraction or other linear classifiers like the kernel perceptron. The fundamental ingredient in these methods is the choice of a kernel function, which computes a similarity measure between two input objects. For good generalization abilities of a learning algorithm it is indispensable to incorporate problem-specific a-priori knowledge into the learning process. The kernel function is an important element for this.
This thesis focusses on a certain kind of a-priori knowledge namely transformation knowledge. This comprises explicit knowledge of pattern variations that do not or only slightly change the pattern’s inherent meaning e.g. rigid movements of 2D/3D objects or transformations like slight stretching, shifting, rotation of characters in optical character recognition etc. Several methods for incorporating such knowledge in kernel functions are presented and investigated.
1. Invariant distance substitution kernels (IDS-kernels): In many practical questions the transformations are implicitly captured by sophisticated distance measures between objects. Examples are nonlinear deformation models between images. Here an explicit parameterization would require an arbitrary number of parameters. Such distances can be incorporated in distance- and inner-product-based kernels.
2. Tangent distance kernels (TD-kernels): Specific instances of IDS-kernels are investigated in more detail as these can be efficiently computed. We assume differentiable transformations of the patterns. Given such knowledge, one can construct linear approximations of the transformation manifolds and use these efficiently for kernel construction by suitable distance functions.
3. Transformation integration kernels (TI-kernels): The technique of integration over transformation groups for feature extraction can be extended to kernel functions and more general group, non-group, discrete or continuous transformations in a suitable way.
Theoretically, these approaches differ in the way the transformations are represented and in the adjustability of the transformation extent. More fundamentally, kernels from category 3 turn out to be positive definite, kernels of types 1 and 2 are not positive definite, which is generally required for being usable in kernel methods. This is the
vii
viii
motivation to investigate the theoretical meaning of such indefinite kernels. The finding is that on given data these kernels correspond to inner products in pseudo-Euclidean spaces. Here certain kernel methods, in particular SVMs, have a reasonable geometrical and theoretical interpretation.
Practical applicability of the kernels is demonstrated in addition to the theoretical properties. For these experiments, support vector classification on various types of data has been performed. The datasets comprise standard benchmark datasets for optical character recognition like USPS and MNIST or real-world biological data resulting from micro-Raman-spectroscopy with the goal of bacteria identification.
In addition to the demonstration that transformation knowledge can be involved in kernel functions in different ways and that these can be practically applied, there are more fundamental findings and perspectives. We demonstrate and theoretically argue that indefinite kernels can be used or tolerated by kernel methods, as exemplified for the SVM. There exist statements about the training-algorithm, the resulting solutions and a reasonable geometric interpretation. This opens up mainly two directions. Firstly, these insights facilitate the process of kernel design, which hitherto is mainly restricted to positive definite functions. In particular, this enables SVMs to be used widely in other fields like distance-based learning, i.e. in all analysis problems, where dissimilarities between objects are available. Secondly, the investigation of suitability or robustness of other kernel methods than SVMs with respect to indefinite kernels seems very promising.
Contents
1 Introduction 1 1.1 Pattern Analysis and Kernel Methods . . . . . . . . . . . . . . . . . . . 1 1.2 Prior Knowledge by Transformations . . . . . . . . . . . . . . . . . . . 3 1.3 Main Motivating Questions . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 7 2.1 Transformation Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Goals for Invariance in Kernel Methods . . . . . . . . . . . . . . . . . . 14 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Invariant Distance Substitution Kernels 19 3.1 Distance Substitution Kernels . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Definiteness of DS-Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Examples of Hilbertian Metrics . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Choice of Origin O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Transformation Knowledge in DS-Kernels . . . . . . . . . . . . . . . . . 28 3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Tangent Distance Kernels 35 4.1 Regularized Tangent Distance Measures . . . . . . . . . . . . . . . . . 35 4.2 Definiteness of TD-Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Invariance of TD-Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Separability Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Transformation Integration Kernels 49 5.1 Partial Haar-Integration Features . . . . . . . . . . . . . . . . . . . . . 49 5.2 Transformation Integration Kernels . . . . . . . . . . . . . . . . . . . . 50 5.3 Invariance of TI-Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 Separability Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 55
ix
x CONTENTS
5.6 Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.7 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6 Learning with Indefinite Kernels 61 6.1 Feature Space Representation . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 VC-bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.3 Convex Hull Separation in pE Spaces . . . . . . . . . . . . . . . . . . . 66 6.4 SVM in pE Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.5 Uniqueness of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.6 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7 Experiments - Support Vector Classification 79 7.1 General Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 79
7.1.1 SVM Implementation . . . . . . . . . . . . . . . . . . . . . . . . 79 7.1.2 Multiclass Architectures . . . . . . . . . . . . . . . . . . . . . . 80 7.1.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Invariant Distance Substitution Kernels . . . . . . . . . . . . . . . . . . 82 7.2.1 Application of SVM Suitability Indicators . . . . . . . . . . . . 83 7.2.2 Comparison to k-NN Classification . . . . . . . . . . . . . . . . 85 7.2.3 Indefinite versus Positive Definite Kernel Matrix . . . . . . . . . 87 7.2.4 Large Scale Experiments . . . . . . . . . . . . . . . . . . . . . . 89 7.2.5 Summary of DS-Kernel Experiments . . . . . . . . . . . . . . . 90
7.3 Tangent Distance Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.3.1 USPS Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.3.2 Micro-Raman Spectra . . . . . . . . . . . . . . . . . . . . . . . 96 7.3.3 Summary of TD-Kernel Experiments . . . . . . . . . . . . . . . 101
7.4 Transformation Integration Kernels . . . . . . . . . . . . . . . . . . . . 102 7.4.1 Toy Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4.2 USPS Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4.3 Summary of TI-Kernel Experiments . . . . . . . . . . . . . . . . 105
8 Summary and Conclusions 107 8.1 IDS and TD-Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.2 TI-Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.3 Indefinite Kernels in SVMs . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.4 Invariant Kernels versus Invariant Representations . . . . . . . . . . . . 111 8.5 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A Datasets 117 A.1 USPS Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 A.2 MNIST Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 A.3 Micro-Raman Spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.4 Kimia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A.5 Unipen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.6 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.7 Cat-Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
CONTENTS xi
A.8 Music-EMD and Music-PTD . . . . . . . . . . . . . . . . . . . . . . . . 125
B Mathematical Details 127 B.1 Invariance of TD-Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 127 B.2 Relations of Distances and Kernels . . . . . . . . . . . . . . . . . . . . 128 B.3 VC-Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 B.4 Derivation of CH Classification . . . . . . . . . . . . . . . . . . . . . . 129 B.5 CH Primal Optimization Problem . . . . . . . . . . . . . . . . . . . . . 130 B.6 Equivalence of CH and SVM . . . . . . . . . . . . . . . . . . . . . . . . 132 B.7 Uniqueness of Stationary Points . . . . . . . . . . . . . . . . . . . . . . 134
C Notation 135
D Abbreviations 139
Introduction
In the present chapter, a short illustrative motivation for the thesis will be given tar- geted at general readers. We comment on the title concepts, hereby remaining slightly informal by avoiding precise notions, definitions and formulas, which will follow in the subsequent chapter. The motivating concepts will be underlined by some intuitive di- agrams. We conclude this introductory part with two central questions and comments on the structure of the thesis.
1.1 Pattern Analysis and Kernel Methods
The main target of our considerations are pattern analysis systems. This notion covers algorithms which are able to derive regularities from given sets of data observations, and are able to perform some kind of prediction on new observations based on the generated regularities [119]. The most common examples are pattern classification systems on which we will explain the general concepts. A corresponding illustration is given in Fig. 1.1.
The typical pattern classification task consists of some type of objects (e.g. handwritten letters as abstract entities) and some finite set of target classes (e.g. the 26 classes of letters). The goal is to have a system that assigns an estimated class label to any formerly unseen object in the best possible way. The lower row in Fig. 1.1 in- dicates, how such a system performs classification: The abstract objects are observed, measured, discretized or preprocessed in some way resulting in patterns (e.g. sequences of digitized 2D points). The next step is the so called feature extraction stage, where an arbitrarily structured pattern is converted into a compact vectorial representation of numerical values, the features. This so called feature vector is a simple object that best possible represents discriminative information of the original object. This feature vector representation can then be fed into a classifier, which is the assignment rule, and results in an estimated class label for each newly observed sample.
In order to result in a good estimate, such classifiers need a learning or training stage, indicated in the upper row. This is frequently based on a set of objects with known class labels, a so called labelled training set. These training objects are subject to the same measurement, preprocessing and feature extraction process as the patterns during later prediction. But instead of prediction, the set of feature vectors and their corresponding labels are used to derive some model of the relation between the feature vectors and their class labels. This model or hypothesis can be used to build a classi-
1
classificationpreprocessing
preprocessing
Figure 1.1: Illustration of training and prediction stage of a typical pattern classification process. The abstract objects are observed and preprocessed into concrete representations of patterns. From each of these specific patterns a feature vector is extracted, which allows a classification rule to be derived (training) or which can be predicted by the obtained classifier.
fication rule to predict the later unlabelled observations, indicated by the black box in the upper right.
Such analysis systems can be used to solve various different learning tasks:
Classification: Learn to assign unknown observations into a finite number of categories, denoted as classes.
Regression: Learn to assign unknown observations a real or vector-valued quantity.
Clustering: Given a set of observations, find a grouping of the points into clusters of similar points.
Novelty Detection: Given a set of observations, decide whether unknown observations are likely to be related to the set of initial observations or not.
The list could be continued, but all of the tasks have the common property of a training phase, where some data is given and a model is derived and a prediction stage, where predictions on known or unknown observations are performed.
These algorithms are of interest in various disciplines. The derivation of regularities can be seen as a kind of learning, which is some property commonly understood to be a characteristic of intelligent systems. This is why such analysis algorithms are investigated and applied in research fields ranging from pattern recognition to machine learning and artificial intelligence. The target of learning a reasonable hypothesis, i.e. a regularity which is guaranteed to generalize to unseen examples, can be formalized by methods from probability theory. Deriving statistical statements for such pattern analysis algorithms is a topic of statistical learning theory. For details on this very
1.2. PRIOR KNOWLEDGE BY TRANSFORMATIONS 3
interesting field of pattern analysis or pattern recognition, we recommend the excellent standard textbooks [36, 125, 35].
A particularly successful branch of such analysis methods has developed in the last decade, the so called kernel methods. These are methods, which enable nice geometrical interpretations, statistical generalization statements and efficient implementations and yield excellent results on the variety of problem classes mentioned above [130, 112, 119]. The fundamental ingredient in these methods is a so called kernel function, which basically is a similarity measure between two patterns. Investigation of such kernel methods and, especially, designing kernel functions will be the main focus of the thesis.
1.2 Prior Knowledge by Transformations
It is reasonable to expect that the performance of a suitable learner increases with more knowledge of the learning task. The knowledge of a learning task comprises the set of learning observations and additional more abstract information, called prior or a-priori knowledge. Figure 1.2 illustrates this relation qualitatively. By increasing the sample number and/or the incorporated prior knowledge, the performance of an artificial or biological learning system (including humans) is expected to increase as indicated in the first and second plot. On the other hand, if a specific performance is wanted, this can be obtained by fewer number of examples if more prior knowledge is available. This is not only intuitively understandable, but frequently observed in machine learning. Also, theoretical statements about the importance of prior knowledge exist, e.g. the no free lunch theorem, the conclusion of which is summarized in [36, p. 454]: “If the goal is to obtain good generalization performance, there are no context-independent or usage-independent reasons to favor one learning or classification method over another.” A similar statement for the choice of patterns is given by the ugly duckling theorem which is similarly summarized [36, p. 458]: “in the absence of assumptions there is no privileged or ’best’ feature representation.” So, both state that no generalization is possible without any prior assumptions in the pattern recognition chain.
The “number of examples” is often expensive in both time and material aspects. E.g. the measurements can be difficult or the labelling of samples requires human interaction, possibly even from a high expert. Also, during the learning process the number of examples is relevant. Due to the frequent super-linear time and memory complexity of learning methods, multiplicity of training sets often results in unfeasible large training times. Thus, incorporation of prior knowledge into a learning system other than by training examples is of utter importance.
A very frequent type of a-priori knowledge is transformation knowledge, i.e. knowledge about typical variations of single patterns. For instance, it might be known that some modification of an observation keeps its inherent meaning unchanged. In the example of the handwritten letter concept, it is irrelevant how large the individual letter is. Changing the size keeps the inherent meaning of the letter unchanged. This insight makes learning easier: Instead of requiring instances of all letters in all sizes, merely one instance per letter type is sufficient.
In practice, different “kinds” of transformations can be found and applied in learning tasks. In general, we require that some meaningful transformations of objects are known and can be modelled. Here we want to note, that we use the notion transfor-
4 CHAPTER 1. INTRODUCTION
a) b) c)
Figure 1.2: Influence of the number of examples and prior knowledge on the recognition performance. a), b) Improvement of generalization with increasing number of samples or enhanced prior knowledge, c) for similar performance, fewer training examples are required if more knowledge is available.
mation for arbitrary mappings of objects to objects. We do not require invertibiliy, bijectivity, etc. as often found in traditional examples like the Fourier, Laplace or wavelet transform. We will give some simple examples of transformations by referring to the case of image data, where meaningful variations of a pattern can be formalized explicitly. In Figure 1.3 b)-h) some transformations of a specific image pattern depicted in a) are given. These transformations have a different nature. Most of them can be continuously or even differentiably parameterized as the translation b), the rotation c), scaling e), contrast reduction g) and smoothing h). In contrast, reflection d) or morphological operations like erosion f) require a discrete parameterization. Some of the transformations can be modelled as so called transformation groups or Lie groups [64] in the case of differentiable transformations. This implies in particular that the transformations can be reversed. Examples are most of the given transformations if the image is assumed to be embedded in the real plane. But the requirement of regarding transformations stemming from a group is too strict in many practical cases. Firstly, in some applications only parts of transformation groups will be reasonable. For instance, in optical character recognition (OCR) only small rotations are acceptable for maintaining a letter’s class: Large rotations will confuse M/W, N/Z or 6/9. Secondly, non-reversible transformations can occur, which consequently cannot be exactly modelled by transformation groups. An example is the line thinning operation by morphological erosion in f). This will produce a uniformly white image for any non-empty input pattern. In general, this reversibility is violated as soon as different objects are transformed to one common pattern by a particular transformation, which implies that information is lost. For instance, large enough non-cyclic translations of images will result in the zero image for any input image if incoming pixels are set to zero and non-zero image pixels are lost. Other transformations operating on image matrices like deleting single pixels, rows, columns etc. are also non-reversible lossy transformations, which might still be reasonable for modelling certain noise, squeezing or general elastic deformations of an image.
The observations of the variety of transformations is not restricted to image data. For instance, in string-processing the deletion of a single character also is a non- reversible transformation, but represents a meaningful transformation as the resulting string has small edit-distance to the original. In the present work we will take both real-world image and non-image data for demonstrating the applicability of our approaches. These different natures of the transformations depending on the applica-
1.3. MAIN MOTIVATING QUESTIONS 5
Figure 1.3: Illustration of image transformations. a) Original pattern, b) translation, c) rotation, d) reflection, e) scaling, f) erosion, g) contrast reduction, h) smoothing.
tions will require different means for being incorporated in kernel methods and kernel functions.
1.3 Main Motivating Questions
The starting point for the thesis is the observation that transformation knowledge in kernel methods is established mainly for either finitely many, small transformations, i.e. local invariance with respect to these transformations [34, 109], or global invariances by invariant feature vectors and subsequent application of common kernels [18, 98]. The whole spectrum of more general transformation knowledge as indicated in the previous section has not been properly investigated. The main questions therefore are:
• How can infinite or finite, group or non-group, continuous or discrete transformations be incorporated in kernel functions?
• What are practical and theoretical properties and consequences of the resulting kernels?
The relevance of the study will mainly be to obtain building blocks for invariance in kernel methods by the proposed kernel functions. Further, a unification of global invariant feature extraction and local invariance in machine learning is obtained. A smooth transition between the non-invariant, locally invariant and totally invariant cases will be possible. New theoretical investigations in the field of indefinite kernels in machine learning will be required and performed. In particular, this promises wide application and theoretical perspectives for general kernel methods.
1.4 Structure of the Thesis
After this introduction, Chapter 2 starts with the required background for the present work. Basic notions and common aspects are formally introduced, on which the remaining exposition builds. The first contribution of this thesis is a part consisting of
6 CHAPTER 1. INTRODUCTION
the Chapters 3 to 5, which propose kernel functions that include transformation knowledge. The kernels will be based on two generic approaches, namely distance measures and integration. Only experiments for enhancing the understanding of the concepts are presented in these first chapters on kernels, whereas real world performance analysis is shifted to the later experimental Chapter 7. Chapter 3 assumes that suitable distance measures are given, from which corresponding distance substitution kernel functions are constructed as presented in [49]. By involving invariant distances, this results in invariant distance substitution kernels. Chapter 4 focusses on a special instance of the kernels from the preceding chapter. It assumes that the invariance is explicitly given as transformations of vector valued patterns. If these transformations are continuous and differentiable, tangent distance kernels can be efficiently constructed. These have been presented in [51]. The second kind of generic approach is presented in Chap. 5, where the technique of integration over transformations, as used for invariant features, is applied to kernel functions. By this, one can construct kernels, which capture wanted invariances. The presentation is an extension of the one in [52] to arbitrary continuous, discrete, group or non-group transformations.
Some of the given approaches lead to kernels, which are not positive definite. This property is commonly required for application of kernels in methods like SVMs. Chap- ter 6 therefore investigates the effects of indefinite kernels. It concludes that the common geometric feature space interpretation can be extended to these cases and common algorithms like SVMs can be used with indefinite kernels. However, some care has to be taken and corresponding suitability criteria are presented. This has mainly been published in [48].
The last part of the thesis, consisting of Chapter 7, presents application and experimental evaluation of the presented kernels. The flexibility, applicability, but also limitations of the approaches are demonstrated by experiments on a variety of problems including well-known benchmark settings and data from current research projects. The experiments will also include comparisons with state-of-the-art methods. Chap- ter 8 summarizes and comments on the main findings of the present work and gives perspectives for possible further research. The Appendix provides information on the used datasets, mathematical details which are left out from the main text for better readability, a list of general notation and frequent abbreviations.
Chapter 2
Background
This chapter introduces the background on which the thesis is based. This consists of the formalization of transformation knowledge and invariance as illustrated in the introduction, the required notation for kernel methods in general and in particular one instance, the support vector machine. We list the detailed goals for the approaches to be developed in the sequel. For a collected list of notation and symbols we refer to the table in Appendix C.
2.1 Transformation Knowledge
The basic starting point in pattern analysis is the representation of the objects under investigation. As indicated in the pattern analysis chain in Fig. 1.1, the abstract objects are not accessible directly, but mainly based on certain representations from observations or higher level post-processing, feature extraction etc. The resulting representations will be called patterns:
Definition 1 (Pattern). We denote an element x from a pattern space X as a pattern. Patterns play the role of the observations of objects which are to be processed by pattern analysis techniques.
A pattern is in particular not required to be vectorial, it can be an arbitrary object. If it occasionally is a vectorial representation, it will be denoted boldface x ∈ X , as done for general vectorial variables.
We make some structural assumption on the generation of the patterns. We restrict to the knowledge of a set of transformed patterns for each sample. The transformation knowledge is the assumption that these patterns have equal or similar meaning as the original pattern itself. So, replacing an individual point by one of its transformed patterns should keep the output of the analysis task roughly unchanged. As argued in the preceding chapter, these transformations can be of different kind, which are all covered by the following formalization:
Definition 2 (Transformation Knowledge). We assume to have a set T of transformations t : X → X which define a set of transformed patterns Tx := {t(x)|t ∈ T} ⊂ X for any x. These patterns are assumed to have identical or similar inherent meaning as the pattern x itself.
7
8 CHAPTER 2. BACKGROUND
At this point we do not put any further assumptions on T . In particular, we do not assume an explicit parameterization of these sets, nor assume that they are finite. We neither require specific relations between the Tx, Tx′ of different patterns. They may be disjoint, may be equal or may intersect. Of course for computationally dealing with these sets, one must assume enumerability of the sets, characterization of the sets by constraints or explicit parameterization of the transformations. If such a parameterization with real valued parameter vector p exists, we denote the transformations as t(x,p).
A traditional way in pattern analysis for involving such transformation knowledge is to perform a preprocessing step by mapping the objects into an invariant representation in some real vector space H. Instead of working on the original patterns, the transformed samples are taken as a basis for investigation. This step was denoted feature extraction in Fig. 1.1.
Definition 3 (Invariant Function, Single Argument). We call a function f : X → H invariant, if for all patterns x and all transformations t ∈ T holds f(x) = f(t(x)). The vector f(x) is then called an invariant representation or invariant feature vector of x.
To emphasize the invariance of an arbitrary function f(x), we will occasionally denote it I(x). In traditional invariant pattern analysis, this notion is used for transformation groups T = G, cf. [116, 115, 139]. In this case the pattern space X is nicely partitioned into equivalence classes Tx, which correspond to the orbits of the patterns under the group action. Then, invariants are exactly those functions, which are constant on each equivalence class, cf. Fig. 2.1 a). Various methods for constructing such invariant features are known, e.g. normalization approaches like moments [20], aver- aging methods [114] or differential approaches. For a general overview of invariance in pattern recognition and computer vision we refer to [18, 85]. This definition of invariance, however, is also meaningful for equivalence classes not induced by groups.
Pattern analysis targets can often be modelled as functions of several input objects, for instance the training data and the data for which predictions are required. For such functions different notions of invariance can be given, each with its own practical relevance. For distinguishing between these different definitions, we introduce discriminating extensions of Def. 3, which will be used throughout the thesis.
Definition 4 (Invariant Function, Several Arguments). We call a function f : X n → H
i) simultaneously invariant, if for all patterns x1, . . . , xn ∈ X and transformations t ∈ T holds
f(x1, . . . , xn) = f(t(x1), . . . , t(xn)).
ii) totally invariant, if for all patterns x1, . . . , xn ∈ X and transformations t1, . . . , tn ∈ T holds
f(x1, . . . , xn) = f(t1(x1), . . . , tn(xn)).
Obviously, for the case of a function with a single argument, both definitions correspond to the invariance according to Def. 3. The first notion i) is used in [116] for polynomial functions under group transformations. In general, this is a common understanding of invariance. The function does not change if the whole space X is globally transformed, i.e. all inputs are transformed simultaneously with an identical
2.2. DISTANCES 9
transformation, cf. Fig. 2.1 b). For example, the Euclidean distance is called translation invariant, the standard inner product rotation invariant [132, 112]. From a practical viewpoint, this type of invariance is useful, as it guarantees that the function is independent of the global constellation of the data. By this it is unaffected, e.g. by changes of the experimental setup: A simultaneously translation-invariant system can operate on data without preprocessing like centering. A simultaneously scale-invariant system will produce the same output on differently scaled datasets, making a uniform scale-normalization superfluous etc. So, these transformations can be ignored in the consecutive analysis chain.
This notion, however, does not capture the transformation knowledge as given above: It only guarantees to remain constant under global transformation of the whole input space. However, if we only translate/rotate one of the several patterns, the Eu- clidean distance and the inner product will in general change. Therefore, we introduce the notion ii) of total invariance to denote functions, which are guaranteed to maintain their value, if any single argument is (or equivalently all simultaneously are) transformed independently, cf. Fig. 2.1 c). Note that this is equivalent to the statement that they are invariant as functions of one argument fixing the remaining ones arbitrarily. The total invariance ii) implies the simultaneous invariance i).
Further variations of invariance exist in invariant theory, such as relative versus absolute invariance, covariance, semi-invariance, etc. [116]. These notions, however, are not relevant in the sequel.
Note that the requirement of invariance is sometimes too strict for practical problems. The points within Tx are sometimes not to be regarded as identical, but only as similar, where the similarity can even vary over Tx. Such approximate invariance is called transformation tolerance in [139], quasi-invariance in [12] or denoted additive invariance in [16]. A well-known and intuitive example is optical character recognition (OCR): The sets Tx of ’similar’ patterns might be defined as rotations of the pattern x under small rotation angles. Exact invariance is not wanted with respect to these transformations. An invariant function namely is not only invariant with respect to T , but due to transitivity, it is invariant to the larger set of finite compositions T of transformations from T , i.e. T0 := T, Ti+1 := T Ti and T := ∪∞
i=0Ti. The relation x ∼ x′ :⇔ Tx ⊂ Tx′ or Tx′ ⊂ Tx exactly is the desired equivalence relation. Intuitively this means, that in OCR an invariant function f will not only be constant on the set of small rotations of a pattern, but by transitivity it must be constant for all rotations. As indicated in the introduction, this results in the letters M/W, Z/N, 6/9 etc. not being discriminable.
We want to cover both the exact invariance and these transformation tolerant cases. Therefore, the kernels proposed in the thesis are in general not precisely invariant, but capture the similarity of the patterns along Tx. In all cases where exact invariance is wanted and the sets Tx form a partition of X , the proposed tools will result in totally invariant functions.
2.2 Distances
Some constructions in the sequel will be based on suitable dissimilarity measures and are in principle also applicable to arbitrary non-invariant proximity data. So, we con-
f(t′(x), t′(x′))
f(x, t′(x′))
Figure 2.1: Illustration of different notions of invariance. a) Invariance of a function with a single argument, b) simultaneous invariance of a function with multiple arguments, c) total invariance of a function with multiple arguments.
tinue with settling the basic notions related to this. In the present work, we call arbitrary functions which represent some measure of proximity or dissimilarity of patterns as dissimilarity functions. In contrast to this unspecified notion, we take the following definition of a distance function.
Definition 5 (Distance, Distance Matrix). A function d : X × X → R will be called a distance function if it is nonnegative and has zero diagonal, i.e. d(x, x′) ≥ 0 and d(x, x) = 0 for all x, x′ ∈ X . Given a set of observations xi ∈ X , i = 1, . . . , n the matrix D := (d(xi, xj))
n i,j=1 is called the distance matrix.
If a dissimilarity function d does not satisfy these conditions, it can easily get zero-diagonal by d(x, x′) := d(x, x′) − 1
2 (d(x, x) + d(x′, x′)) and made nonnegative by
2.3. KERNEL METHODS 11
d(x, x′) := |d(x, x′)|. In particular, this definition is more general than a metric, as we allow non-symmetric distances. We further do not require the triangle inequality to be satisfied, and we allow d(x, x′) = 0 for patterns x 6= x′.
The reason for fixing such a definition is, that the notion distance is treated quite diversely in literature. Different assumptions and requirements for a notion of distance have been discussed. At least 12 systematically differing notions of proximities can be found in [31]. The possibilities range from the strong structure S1 (Euclidean distance) to the highly unstructured S12, where the dissimilarity has no quantitative meaning except dividing the set into equivalence classes of similar objects. Our definition above refers to the structure denoted S4. If d additionally is symmetric, the corresponding structure is S3. Further categorizations of dissimilarity measures exist, e.g. [136].
In general, different choices of the distance model, imply more or less mathematical theory to be available for analysis. For instance, restricting to metrics would allow mathematical background by well-studied metric spaces. Such strict assumptions are, however, often not satisfied by experimental measurements, which are to be analyzed. Hence, the reason for our notion above is mainly practical: When constructing dissimilarity measures in practice, it is hard to imply special prior-knowledge while satisfying strict conditions. When investigating given dissimilarities, the data source frequently does not care about nice mathematical properties. Consequently, various existing established dissimilarity measures do not satisfy certain conditions. For instance, it has already been stated in [90] that many dissimilarity measures in pattern analysis and computer vision turn out to be non-metric. In psychometric experiments, frequently non-symmetric dissimilarities are observed. In particular, the former will be exemplified in the sequel.
Special distances which can be interpreted as an L2-norm, i.e. the metric of a Hilbert space, will play a role in the sequel, so we define similar to [57]:
Definition 6 (Hilbertian Metric, Euclidean Distance Matrix). A distance function d on the pattern space X is called a Hilbertian Metric if the data can be embedded in a real Hilbert space H by Φ : X → H such that d(x, x′) = Φ(x) − Φ(x′). A matrix D is called Euclidean distance matrix if it is a distance matrix of a Hilbertian metric.
Note that these distances necessarily are semi-metrics, i.e. being symmetric and satisfying the triangle inequality. A Hilbertian metric d, however, does not have to be a metric, as different patterns x 6= x′ can still result in d(x, x′) = 0. But in this case, those patterns in general will not be discriminable, as for all other points x′′ holds d(x, x′′) = d(x′, x′′), due to the triangle inequality and the fact, that x and x′ must be mapped to the same point in any embedding Hilbert space. The motivation for the latter notion of Euclidean distance matrix is, that in the case of finite data, the span of these finitely many points in any embedding real Hilbert space is finite dimensional, so isomorphic to a real Euclidean space.
2.3 Kernel Methods
From dissimilarities, we turn our focus on the class of similarity measures, the so called kernels, which are the fundamental ingredient in kernel methods, cf. [112]. We restrict ourselves to real valued kernels, as the resulting applications only require those, though complex valued definitions are possible, cf. [11].
Definition 7 (Kernel, Kernel Matrix). A function k : X ×X → R which is symmetric is called a kernel. Given a set of observations xi ∈ X , i = 1, . . . , n the matrix K := (k(xi, xj))
n i,j=1 is called the kernel matrix.
We emphasize that this use of the notion kernel is wider than frequently used in literature, which often requires positive definiteness, defined below. As we also will discuss non positive definite functions to be used, we extend the notion kernel also covering these cases.
Definition 8 (Definiteness). A kernel k is called positive definite (pd), if for all n and all sets of data points (xi)
n i=1 ⊂ X n the kernel matrix K is positive semi-definite,
i.e. for all vectors v ∈ R n holds vTKv ≥ 0. If this is only satisfied for those v with
1T nv = 0, then k is called conditionally positive definite (cpd). A kernel is indefinite, if
a kernel matrix K exists, which is indefinite, i.e. vectors v and v′ exist with vTKv > 0 and v′TKv′ < 0.
The cpd kernels are related to pd kernels [112, Prop. 2.22], as for any choice of origin O ∈ X the kernel k is cpd if and only if k is pd, where k is given as
k(x, x′) = k(x, x′) − k(x,O) − k(O, x′) + k(O,O). (2.1)
We denote some particular inner-product- and distance-based kernels by
klin(x,x′) := x,x′ knd(x,x′) := −x − x′β , β ∈ [0, 2]
kpol(x,x′) := (1 + γ x,x′)p krbf(x,x′) := e−γx−x
′2
, p ∈ N, γ ∈ R+.
Here, the linear, polynomial and Gaussian radial basis function (rbf) are pd for the given parameter ranges. The negative distance kernel is cpd, which is completely sufficient for application in certain kernel methods such as SVMs, cf. [11, 107]. We now come to the main field of application of such kernel functions in pattern analysis, which are the so called kernel methods, cf. [119, 112] and references therein for details on the notions and concepts. In general, a kernel method is a nonlinear data analysis method for patterns from some set x ∈ X , which is obtained by application of the kernel trick on a given linear method: Assume some linear analysis method operating on vectors x from some Hilbert space H, which only accesses patterns x in terms of inner products x,x′. Examples of such methods are principal component analysis (PCA), linear classifiers like the perceptron or Fisher linear discriminant etc. If we assume some nonlinear mapping Φ : X → H, the linear method can be applied on the images Φ(x) as long as the inner products Φ(x), Φ(x′) are available. This results in a nonlinear analysis method on the original space X . The kernel trick now consists in replacing these inner products by a kernel function k(x, x′) := Φ(x), Φ(x′): As soon as the kernel function k is known, the Hilbert space H and the particular embedding Φ are no longer required. For suitable choice of kernel function k, one obtains methods, which are very expressive due to the nonlinearity and cheap to compute, as explicit embeddings are omitted.
The question, whether a given kernel allows a representation in a Hilbert space as k(x, x′) = Φ(x), Φ(x′) is interestingly completely characterized by the positive definiteness of the kernel. Various methods of explicit feature space construction can be given. Theoretically most relevant is the so called reproducing kernel Hilbert space
2.4. SUPPORT VECTOR MACHINES 13
(RKHS), which enables the embedding of the whole space X into a Hilbert space of functions. Practically most intuitive is the empirical kernel PCA map, which performs an embedding of given data points into a finite dimensional Euclidean vector space. A modification of this will be applied in Chap. 6, where embeddings of indefinite kernels will be the basis for further investigation.
By this kernel trick, various kernel methods have been developed and successfully applied in the last decade. This enables a wide variety of analysis algorithms, once a suitable kernel is chosen for the data. In addition to the flexibility of analysis algorithms, also a multitude of kernel functions are being developed. These are not restricted to vectorial representations of the objects in contrast to many traditional methods. Instead, there are kernels for a wide range of structured or unstructured data types, e.g. general discrete structures [54], data-sequences [135], strings [77], weighted automata [30], dynamical systems [124] etc.
This modularity of choice of kernel function and choice of analysis method is a major feature of kernel methods. Additionally, this emphasizes the importance of the kernel choice: The only view on the data, which the analysis algorithm obtains, is the kernel matrix K, i.e. the kernel evaluated on all pairs of input objects. Therefore, the kernel matrix is very reasonably denoted an information bottleneck in any such analysis system [119]. Hence, a good solution for any analysis task will require a well designed kernel function.
2.4 Support Vector Machines
The most popular representative of kernel methods is the support vector machine (SVM) for classification problems. This will also be the main method, on which the proposed kernels are applied in Chap. 7. In recent years SVMs have been established as methods of first choice on various learning problems in many fields of applications, cf. [33, 119]. There are several reasons for their success. The main arguments for practitioners are existing fast implementations like SVMLIGHT [63], LIBSVM [21], LIBSVMTL [97] and the general ease of use, as in SVMs only few architectural deci- sions have to be taken: Only a positive definite kernel function and some parameters have to be provided. Even the choice of these few parameters can be automatized by model selection strategies, e.g. [26].
Very important arguments for theoreticians are the foundations in statistical learning theory and the clear intuitive geometric interpretation [129]. SVMs are hyperplane classifiers in implicitly defined Hilbert spaces. They perform optimal separation of patterns by margin maximization. This is the basis for general understanding, adequate practical application and improvements. However, this learning theoretic and geometric interpretation is only available in the case of conditionally positive definite (cpd) kernel functions, which will be extended to the non-cpd case by this thesis in Chap. 6.
The formal derivation of the SVM training and classification methodology can be found in almost every tutorial and textbook on SVMs, e.g. [129, 15, 112]. It starts with the geometrical motivation of large margin separation with linear decision lines, derives the dual optimization problem resulting from the intuitive primal problem, introduces slack-variables for tolerating outliers and performs the kernelization step resulting in the powerful nonlinear SVM. So, we refrain from reproducing these detailed derivations and
state the final optimization problem corresponding to the so called C-SVM. Assuming training data (xi, yi) ∈ X × {±1} for i = 1, . . . , n and a kernel function k, the usual SVM classification approach solves the dual optimization problem which we will refer to as (SVM-DU)
max α1,...,αn
i αiyi = 0.
Here C > 0 is a factor to be chosen in advance penalizing data fitting errors, in particular C = ∞ results in the hard margin SVM, which tolerates no outliers. Lowering C increases the tolerance against margin violations and results in the soft margin SVM. The classification of new patterns x is then based on the sign of
f(x) = ∑
i
αiyik(xi, x) + b, (2.3)
where b is determined such that f has identical absolute values on unbounded SVs, e.g. by choosing xk and xl of opposite classes yk = +1 and yl = −1 with 0 < αk, αl < C and setting
b := −1
. (2.4)
Interpretation of SVMs as separation of convex hulls [32, 10] even suggests, that the value b could be chosen slightly different in the soft margin case.
2.5 Goals for Invariance in Kernel Methods
SVMs are regarded as discriminative approaches in contrast to generative approaches. The latter involve structural knowledge about the generative origin of the patterns. A reasonable goal is stated in [59]: “An ideal classifier should combine these two com- plementary approaches.” This also holds for other discriminative kernel methods. As transformation knowledge can be seen as a kind of generative knowledge, the approaches described in the thesis can be seen as combining generative and discriminative ideas, if used in discriminative analysis methods. We will briefly list the desired properties, which will be addressed and satisfied by the proposed approaches.
1. Various Kernel Methods: The approaches should not be learning-task-specific, especially not restricted to SVMs or other optimization-based techniques. In principle, the approaches should support arbitrary analysis methods implicitly working in the kernel-induced feature space.
2. Various Kernel types: The approaches should be applicable to various kernel functions. In particular, arbitrary distance-based or inner-product-based kernels should be supported in contrast to methods which are only used with the Gaussian rbf.
3. Various Transformations: The methods should allow to model both infinite and finite sets of transformations. The transformations should comprise group- and non-group transformations, discrete and continuous ones. The modelled T should possibly consist of exponential many combinations of such base transformations.
2.6. RELATED WORK 15
4. Adjustable Invariance: The extent of the invariance should be explicitly adjustable from the non-invariant to the totally invariant case. In cases, where the sets Tx are a partition of the pattern space, exact invariance should be possible.
5. Separability Invariance alone is not a desirable property, e.g. any constant function is invariant, but can not be reasonably used for pattern analysis. Equally important is the separability of the methods, i.e. the ability of discriminating patterns. A corresponding improvement must be demonstrated.
6. Complexity: The computation of the kernels should be efficient, in particular cheaper than template matching, i.e. generating and comparing all transformed samples. The proposed methods should possibly not explicitly compute all transformed patterns, but rather avoid large parts of these. Additional acceleration by possible precomputations of certain pattern-specific quantities and reuse of other preceding results should be possible. The methods should result in adequately sized models.
7. Applicability: The applicability on real world problems should be possible with respect to both computational demands and good generalization performance. At least standard benchmark datasets must be possible to treat. The methods should compete with or outperform existing approaches.
The first goal will be clearly satisfied, if we focus on construction of kernel functions. Practical applications, however, will in the sequel only be demonstrated for SVMs. In Fig. 2.2 we indicate a general motivation for considering invariant kernels. They nicely interpolate between the well-known methods of invariant features applied in ordinary kernels and the other extreme of template matching. Invariant kernels are therefore conceptionally covering but also extending these traditional methods. The main difference between the approaches can be seen in the distribution of the computational load indicated by the area of the blue rectangles. The kernel computation process can be divided into a stage of sample-wise precomputations, e.g. the invariant features for two patterns, followed by the kernel evaluation, which uses the precom- puted quantities. In the left it can be seen, that the main contribution is the expensive computation of invariant features, whereas the kernel evaluation is quite cheap by usually few arithmetic operations in the simple standard kernels. In the case of template matching on the right, not many precomputations are possible, the computational load is concentrated on the evaluation of the kernel, which involves matching all transformed patterns against each other. Invariant kernels between those extremes will allow some sample-wise precomputations of (possibly non-invariant) quantities and require some more than trivial computations for kernel evaluation than the simple kernels.
A more detailed discussion of the benefits of invariant kernel functions and their relation to invariant features will follow in the concluding Chapter 8.
2.6 Related Work
Few publications on invariance in kernel methods existed at the initial phase of the thesis, but meanwhile further methods have been proposed by other parties. Some of these methods are conceptionally very interesting, however, none of the methods meets
+
′(xj))
Figure 2.2: Qualitative complexity comparison of invariant kernels covering the whole spectrum from invariant feature extraction to template matching. The computational weight is shifted from sample-wise precomputations to the kernel evaluation.
all of the requirements listed in the previous section. Presentations of different techniques for combining the information of transformation invariances with SV-learning are presented, e.g. in [34, 112]. We give details of two methods described there, namely the virtual support vector and the jittering kernels method, as they are most widely accepted and therefore used for comparisons in the sequel. Further methods are briefly mentioned. In the following chapters, separate sections will be devoted to additional specific related work.
Virtual Support Vector Method: The virtual support vector (VSV) method mod- ifies the training set. The starting point is the idea of generating virtual training data by applying a finite number nT of transformations T = {ti}nT
i=1. on the training points and training on this extended set. This approach can be traced back to [87, 92].
As the size of this extended set is a multiple of the original one, training is computationally demanding, in particular if training time scales polynomially with the training samples, as with SVMs. To reduce this burden, the VSV- method is a two step method: First an ordinary SVM training is performed on the original training set, then the set of resulting SVs is extracted, which usually has a largely reduced size, these SVs are multiplied by small transformations, finally a second SVM training is performed on this extended set of samples.
The advantage of training set modification is that no kernel modification is required. All standard kernels can be applied. Particularly, positive definiteness is guaranteed. This is the reason for the VSV method to be the most widely accepted method for invariances in SVMs.
Problems are the two training stages and a highly increased number of SVs after the second stage, which leads to longer training and classification times [109]. Ad- ditionally, this method is only applicable for finitely many transformed samples.
2.6. RELATED WORK 17
The independent combination of different transformations results in an exponential growing number of virtual SVs, which is prohibitive for several simultaneous transformations. So, this method has some problems with the goals 1, 3 and 6.
Jittering Kernels: The so called jittering kernels are also based on the idea of a finite number of small transformations T = {ti}nT
i=1 of the training points. Instead of performing these transformations before training, they are performed during kernel evaluation. Starting with an arbitrary kernel function k, the computation of the jittered kernel kJ(x, x′) is done in two steps: Firstly, determine the points minimizing the distance of the sets Φ(Tx) and Φ(Tx′) in the kernel induced feature space by explicitly performing all transformations and computing the squared distances. Secondly, take the original kernel at these points:
(i, i′) := arg min i,i′=1,...,nT
k(ti(x), ti(x)) − 2k(ti(x), ti′(x ′)) + k(ti′(x
′), ti′(x ′))
kJ(x, x′) := k(ti(x), ti′(x ′)). (2.5)
This scheme nicely reflects the idea of operating in the kernel-induced feature space. It is, however, only well-defined for kernels with k(x, x) = const like distance-based kernels. For inner-product-based kernels the definition is not proper. The reason is that occasionally multiple minima (i, i′) of the nonlinear optimization problem can yield different values if inserted in the kernel function. So, it is undecided, what kernel value is to be used in this case. Additionally, the minimization of the distance between the so called jittered example sets destroys the positive definiteness of the base kernel. The method is only applicable to finitely many transformed patterns. So, this method does not completely meet the goals 2, 3 and 6.
Further Methods: In [16], a theoretical study of invariance in kernel methods is formalized for differentiable transformations. In our terms, the main idea can be expressed as requiring that the kernel function does not change in transformation directions by corresponding differential equations, i.e. for all x and x0 holds ∇xk(x0,x) · ∂pt(x, p) = 0. This yields partial differential equations, which must be solved to obtain invariant kernel functions. This turns out to be a highly nontrivial task as many integrals have to be determined. For academic examples of cyclically translating a vector with few entries this seems to be applicable, as the nontrivial exact integrals solving the PDEs can be stated. This method, however, is far from being applicable to even small sized image data.
The recent method of tangent vector kernels [93] mainly reuses the motivation of the tangent distance kernels presented in Chap. 4, where additional comments will follow.
The method called invariant hyperplane [110] or the nonlinear extension called invariant SVM [24] modify the method’s specific optimization target. It tries to alter the hyperplane in such a way, that it globally fits all local invariant directions best. This turns out to be equivalent to a pre-whitening in feature space along these directions which fit best to all local invariance directions simultaneously. The advantage is the use of the original SVM training procedure after pre-whitening the training data. The method is very elegant, as it nicely
can be interpreted exchangeably as kernel modification or as a method changing the optimization target in terms of a tangent covariance matrix. However, this pre-whitening involves (kernel) PCA and appears to be computationally very hard in the nonlinear case. The method therefore is restricted to differentiable transformations with few parameters. Only experiments on fractions of USPS are reported for the general nonlinear case, in contrast to the whole USPS set as presented in Section 7.4.2. A similar approach has been applied to kernel Fisher discriminant, which can also be reformulated using the tangent covariance matrix [82]. A conceptional argument against these methods might be, that they do effectively not respect the local invariances: They rather use the average of the local invariance directions of all training samples in feature space. These average tangent covariance matrices per transformation parameter are then additionally averaged over multiple transformation directions.
Other conceptionally appealing methods consist of sophisticated new optimization problems, which encode the sets of transformed samples. By enforcing separability constraints as in the SVM, new classifiers are produced. For instance, assuming polyhedral sets Tx results in the approach of the knowledge-based SVM [39]. If differently polynomial trajectories of the patterns are assumed, the resulting infinitely many constraints are condensed in a semi-definite programming (SDP) problem. This problem is more complex, but can still be solved for small sizes. However, again it is problematic for standard benchmark datasets as the USPS digits [43].
Some more studies are not aiming at total invariance in SVMs, but investigate the behaviour of an SVM concerning uniform global transformations. For instance, [62, Lemma 2] states that the SVM solution, (the coefficients αi and the decision function) are invariant with respect to global addition of real values c to the kernel function. The simultaneous rotation invariance of the Euclidean inner product and the additional translation invariance of the induced distance transfers to similar transformation behaviour for resulting SVM solutions with various distance or inner product kernels, cf. [1, 101].
Basically, [112] distinguishes between methods for introducing transformation knowledge into the object representation, the training set or the learning method itself. Al- though this is an intuitive categorization, certain methods can be interpreted in multiple of these categories. For instance, as explained above, the invariant hyperplane method can be seen as modifying the object representations by a pre-whitening operation, but it was initially motivated as modifying the learning target of an SVM. In [73] it is argued that also the extension of the training set is in general equivalent to adding a suitable regularization term into the learning target.
Similar to the jittering kernels approach, we concentrate in the following explicitly on methods modifying the learning method by designing suitable kernel functions.
Chapter 3
Invariant Distance Substitution Kernels
This chapter proposes a method for incorporating general transformation knowledge into kernel functions by suitable distance functions. We first formalize a general method to construct kernel functions from distances by distance substitution (DS) in common distance-based or inner-product-based kernels. In particular, these DS-kernels are not explicitly requiring transformation knowledge. Additionally, they can be used for arbitrary distance measures. We investigate theoretical properties and comment on practical issues. Then, we focus on the main topic of the thesis, the incorporation of transformation knowledge. We explain how many existing distance measures represent transformation knowledge according to Def. 2, and how such distance measures can be constructed. Using such invariant measures in DS-kernels yields invariant distance substitution kernels. In this part we will restrict to the theoretical framework, whereas the following chapter and the experiments will give specific instances of the resulting invariant kernels.
3.1 Distance Substitution Kernels
A general construction method for kernels involving arbitrary distances has been proposed in our study [49]. The resulting kernels are not only applicable to invariant distances, but they can also be applied to various common dissimilarity measures. So, until Sec. 3.6 we do not explicitly require that the distance measure represents transformation knowledge. This makes the kernels very widely applicable. Especially, they offer the wide perspective of applying modern kernel methods to structural pattern recognition problems, where sophisticated distance measures, matching procedures etc. are given. Correspondingly, the experiments in Chap. 7 include both invariant and non-invariant distance measures.
In literature, several studies construct SVM kernels from special distances d(x, x′) between objects by taking the Gaussian rbf-kernel and plugging in the problem-specific distance measure d. We give a short but possibly non-exhaustive list of these. Initial experimental results, mainly classification with certain Hilbertian metrics on histogram data have been presented in [23]. In [8] another analysis problem is tackled, segmen- tation of images by spectral partitioning. The Gaussian kernel with the χ2 and an
19
20 CHAPTER 3. INVARIANT DISTANCE SUBSTITUTION KERNELS
intervening contour distance is applied, the former proven to be pd, the latter stated to be indefinite. Kullback-Leibler divergence between distributions has been used in SVMs [84] after making the kernel matrix positive definite. Recent examples of distances in kernels can be found in [57], where again SVM classification of histogram data was performed and [28], which applied DS-kernels on semi-supervised learning problems. Further DS-kernel applications are available, e.g. M-estimators in [29]. We have demonstrated the applicability of transformation invariant distances on online handwriting recognition [4] and optical character recognition [51].
The aim of this section with respect to the existing work is to give a common framework and to cover more kernel types including linear and polynomial kernels and discuss upcoming questions. We further explicitly allow general symmetric distances in contrast to only regarding Hilbertian metrics. For such an arbitrary symmetric distance measure d, the choice of an origin O ∈ X induces a generalization of an inner product by the function
x, x′Od := −1
2 (d(x, x′)2 − d(x,O)2 − d(x′, O)2). (3.1)
This notation reflects the idea that in the case of d being the L2-norm in a Hilbert space X , x, x′Od corresponds to the inner product in this space with respect to the origin O. Note that this is only a formal definition, usual properties like bilinearity are not satisfied. Still a suitable interpretation as an inner product exists after suitable embeddings into pseudo-Euclidean spaces, cf. Chap. 6. With this notion we define the distance substitution kernels as follows.
Definition 9 (DS-Kernels). For any distance-based kernel k, i.e. k(x − x′), and symmetric distance measure d we call kd(x, x′) := k(d(x, x′)) its distance substitution kernel (DS-kernel). Similarly, for an inner-product-based kernel k, i.e. k(x,x′), we call kd(x, x′) := k(x, x′Od ) its DS-kernel.
This notion in the case of inner-product-based kernels is reasonable as in terms of (3.1) indeed distances are substituted. Note that they are well-defined for these inner- product-based cases in contrast to the jittering kernels, cf. the comments following (2.5). In particular, for the simple linear, negative-distance, polynomial and Gaussian kernels, we denote their DS-kernels by
klin d (x, x′) := x, x′Od knd
d (x, x′) := −d(x, x′)β, β ∈ [0, 2] (3.2)
kpol d (x, x′) :=
1 + γ x, x′Od )p
krbf d (x, x′) := e−γd(x,x′)2 , p ∈ N, γ ∈ R+.
Of course, more general distance- or dot-product-based kernels exist and corresponding DS-kernels can be defined, e.g. sigmoid, multiquadric, Bn-spline, exponential, rational quadric, Matern kernels, [112, 40] etc. The motivation for this ad-hoc definition of the simple DS-kernels is twofold: In the case of Euclidean distances, this obviously recovers the standard kernels. Additionally, in the case of non-Euclidean distances, the kernels are expected to reflect the same behaviour as the standard kernels, in particular the nonlinearity and herewith improved separability by the polynomial or the Gaussian rbf- kernel. Indeed these properties can be experimentally validated. Further, the handling of the kernels in practice is expected to be similar, e.g. with respect to parameter selection.
3.2. DEFINITENESS OF DS-KERNELS 21
A comment on the computational complexity is in order, in particular for the case of expensive distances. Equation (3.1) might indicate that the inner-product-based kernels are a multitude slower than the distance-based ones. According to single kernel evaluations this is true, as indeed three distance computations are required instead of one. However, as typical in kernel methods, a whole kernel matrix must be computed for analysis of a dataset. If the distances of the data points with the origin O are computed in advance in linear time, the dominating operation is the O(n2) computation of the pairwise distances between the data points. So, in terms of distance computations the slowdown factor reduces from 3 arbitrarily close to 1 for large enough data sets. Computationally, the simple DS-kernels in (3.2) can therefore be implemented with comparable number of distance evaluations. A general very convenient property of the DS-kernels is the possibility of precomputation of the distance matrices. Usual model selection in kernel methods requires modification of basic kernel parameters as γ, β, p for the simple kernels. These variations are completely independent of the used distances, which might be arbitrarily complex. So, DS-kernels allow fast kernel parameter variations and model selection.
We continue with investigation of the DS-kernels’ general properties. As the kernels are a common framework for the existing experimental studies listed above, the conclusions of the following considerations are similarly valid for the cited studies.
3.2 Definiteness of DS-Kernels
The most interesting question posed on new kernels is whether they are (c)pd. In fact, for DS-kernels given by (3.2) the definiteness can be summed up quite easily. The necessary tools are well-known based on early references [140, 104, 105, 81, 42], which have since then been reformulated in various ways, cf. [11, 112]. Very nice extensive characterizations of constructing pd and cpd functions from Hilbertian metrics are given in [104] based on the notion of completely monotone functions, where a continuous f(t), t ≥ 0 is called completely monotone if (−1)kf (k)(t) ≥ 0 for all 0 < t < ∞ and k ∈ N0, which is a criterion which can easily be checked for a given function, e.g. by mathematical induction.
Proposition 10 (Definiteness of General DS-Kernels). Let d be a Hilbertian metric.
i) If k(t) is completely monotone then k(d2) is pd.
ii) If − d dt
k(t) is completely monotone and k(0) = 0 then k(d2) is cpd.
Proof. i) is one direction of [104, Thm. 3], for which a very elegant proof recently has been given in [5]. ii) is a combination of one direction of [104, Thm. 6’] with a Hilbert space representation of cpd functions [11, p. 82].
These statements allow to conclude that if d is a Hilbertian metric, e.g. the kernel 1
γ+d2 or e−γd2
is pd for γ > 0. Further statements for definiteness of more general dot- product- or distance-based kernels are possible, e.g. by Taylor series argumentation, cf. [11, p. 70, Cor. 1.14]. Additionally, known closure properties of the set of positive definite kernels can be applied, e.g. positive linear combinations of pd functions yield pd functions, products of pd functions are pd etc. These result in further DS-kernels.
For simple DS-kernels, the (c)pd-ness can be nicely concluded with equivalences in all combinations:
Proposition 11 (Definiteness of Simple DS-Kernels). The following statements are equivalent for a (nonnegative, symmetric, zero-diagonal) distance d:
i) d is a Hilbertian metric
ii) knd d is cpd for all β ∈ [0, 2] iii) klin
d is pd
iv) krbf d is pd for all γ ∈ R+ v) kpol
d is pd for all p ∈ N, γ ∈ R+.
Proof. i) implies ii): [112, Prop. 2.22] covers the case β = 2 and Prop. 10 ii) settles the statement for arbitrary β ∈ [0, 2] as − d
dt tα is completely monotone for 0 ≤ α ≤ 1. The
reverse implication ii) ⇒ i) follows by [112, Prop. 2.24]. Equivalence of ii) and iii) also is a consequence of [112, Prop. 2.22]. [105, Thm. 1] states the equivalence of ii) and iv). Statement v) follows from iii) as the set of pd functions is closed under products and linear combinations with positive coefficients. The reverse can be obtained from the pd functions 1
γ kpol
d . With p = 1 and γ → ∞ these functions converge to x, x′Od . Hence, the latter also is pd.
In practice, however, problem-specific distance measures often lead to DS-kernels which are not (c)pd. A practical criterion for disproving (c)pd-ness is the following. If d violates the triangle inequality then it is in particular not a Hilbertian metric, so the resulting DS-kernel knd
d is not cpd and klin d , krbf
d , kpol d are not pd due to Prop. 11. This
can immediately be applied to kernels based on tangent distance (TD) [51], dynamic time warping (DTW) distance [4] or Kullback-Leibler divergence [84]. Also, applying further common non-metric distance measures, e.g. [60], will therefore produce non-pd kernels. Note, that for certain values β, γ, p, the resulting DS-kernels are possibly (c)pd, but not for all parameters simultaneously. Remind also that the reverse is not true, i.e. metricity is not sufficient for (c)pd-ness of the resulting DS-kernels. In particular, the Lp-metrics for p > 2 can be shown to produce non-pd DS-kernels in contrast to lower values for p as discussed in the next section.
If a kernel is not positive definite in general, it can be positive definite on a certain given dataset. Even if it is not positive definite on a given dataset, it can be applied in certain kernel methods, as long as they can work with indefinite matrices, or they tolerate a certain amount of negative eigenvalues, e.g. matrix centering, inversion, application of SVMs or the kernel perceptron. Details of the effects of indefinite kernels on SVMs will be elucidated in Chap. 6. In real world problems, the indefinite kernel matrices often still offer a high degree of positivity. For instance, the krbf-DS-kernels will always generate positive semi-definite 2× 2 matrices: The assumptions on the distance d imply a matrix-diagonal with value 1 and off-diagonal elements between 0 and 1, which always yields two nonnegative eigenvalues. For larger matrices, the number of negative eigenvalues and the overall sum of negative eigenvalues turns out to be small.
We demonstrate this for the simple DS-kernels on various datasets in Fig. 3.1. The datasets are denoted as proteins, kimia-1, cat-cortex and unipen-1 and will be mainly used for the experiments in Chap. 7. Details on the setup and the sources of the datasets are given in Appendix A. Here it is sufficient to know that they consist of problem specific distance matrices with two- or multiclass labellings. The used distances are a) an evolutionary distance between proteins, b) a modified Hausdorff distance between
3.2. DEFINITENESS OF DS-KERNELS 23
0 50 100 150 200 250 −3
−2
−1
0
0
5
10
15
20
i
0
100
200
300
i
0
2000
4000
6000
8000
i
a) b)
c) d)
Figure 3.1: Illustration of DS-kernel spectra for different distance matrices and the resulting characteristics n−, n+ (number of negative/positive eigenvalues) and the ratio r of the overall negative and positive variance. a) knd on proteins, b) krbf on kimia-1, c) klin on cat-cortex, d) kpol on unipen-1.
binary shape-images, c) a dissimilarity based on the connectivity strengths of regions in a cat’s cerebral cortex and d) a dynamic time warping distance between handwritten characters from the UNIPEN dataset. In addition to the sorted eigenvalue spectra, we added some quantitative measures of the definiteness. These comprise n−, n+, the numbers of negative and nonnegative eigenvalues, and a measure of the negative variance contribution. As the eigenvalues of a kernel matrix can be interpreted as variance of appropriate data embeddings, the sums correspond to the amount of information (variance) captured in the negative and positive direction, respectively. So, we define the ratio r of the negative to the positive eigenvalue sum
r(K) :=
∈ [0,∞]. (3.3)
Ideally, this quantity is 0 for a positive definite matrix, whereas it goes to infinity, if a matrix has only negative eigenvalues. The upper row demonstrates the spectra of the distance-based kernels. Plot a) contains the spectrum of the knd kernel matrix on the proteins-dataset, b) the krbf-kernel on the kimia-1 dataset. krbf demonstrates nicely, how the Gaussian kernel is largely dominated by the positive eigenvalues, only
few negative values are observed. One single large negative eigenvalue of knd would be acceptable, as cpd kernel matrices have at most one (arbitrarily large) negative eigenvalue. The second row depicts the spectra of the inner-product-based kernels obtained by choosing the point with minimum squared-distance-sum as the origin O. The reason for this choice will be given in Sec. 3.5. The klin kernel on the cat-cortex dataset in plot c) also behaves largely dominated by the positive eigenvalues. Finally, the kpol kernel on the unipen-1 data in d) demonstrates similar properties. These are not accidental patterns, but can be widely observed. Even solely based on random distances, the resulting kernel matrices reveal these characteristics.
3.3 Examples of Hilbertian Metrics
As argued in the preceding section, practical dissimilarity measures will frequently result in indefinite DS-kernels. Learning from such kernels indeed is possible, as we will argue in Chap. 6, but it is still largely uninvestigated. One purist attitude to kernels is not to accept anything beyond the positive definite ones. Therefore, we want to comment in this section, why DS-kernels can even be useful from this pure view.
Actually, many cases exist, where DS-kernels can be used completely safely as the distances indeed are Hilbertian metrics. For some distance measures, the relation to an L2-norm is apparent. An example is the Hellinger distance H(p, p′) between probability
distributions which is defined by (H(p, p′))2 := ∫ (√
p −√ p′ )2
. This is obviously the usual L2-norm on the square-roots of the distributions. However, the class of distances
which are Hilbertian metrics is much wider than the obvious forms d = √
x − x′2.
For instance, many of the simple Lp norms ·p are related to Hilbertian metrics, where p is not 2. The paper [6] concludes that −x − x′1 is cpd based on the el- der result from [105, 81] that −x − x′2 is cpd. Further conclusions are that also
−x − x′p p is cpd for p ∈ (0, 2]. This implies that x − x′1/2
1 and more general
x − x′p/2 p is a Hilbertian metric. As fractional powers of nonnegative cpd functions
may be taken, e.g. [11, p. 78], this implies that x − x′q p is a Hilbertian metric for
p ∈ [0, 2] and q ∈ [0, p/2]. When the objects are observations of binary random variables i, j and the frequen-
√
1 − S(i, j), where S is one of various suitable similarity measures [42]. For various choices of S = Sθ or S = Tθ, the resulting dissimilarity will generate Euclidean distance matrices. The reference proves this for parametric families of dissimilarities
Sθ = A + D
A
A + θ(B + C) ,
if 0 < θ ≤ 1 for Sθ and 0 < θ ≤ 1/2 for Tθ. The same exposition investigates several dissimilarities for real valued data, and concludes their metric and Euclidean properties. Among others, the measure
d(x,x′) := ∑
i
3.3. EXAMPLES OF HILBERTIAN METRICS 25
is proven to generate Euclidean distance matrices. Similarly, [8] proves very elegantly that krbf√
χ2 is pd, where
xi + x′ i
√
χ2 is a Hilbertian metric. Only looking at the χ2-distance, the corresponding Hilbert space is not apparent.
Other examples of Hilbertian metrics are given in a recent paper [57]. The objects between which dissimilarities are constructed are probability measures on an arbitrary set X . A parametric class of Hilbertian metrics is constructed based on suitable integrals of Hilbertian metrics on R+. Interesting spec