Transformation Knowledge in Pattern Analysis with Kernel Methods -
Distance and Integration Kernelsan der Albert-Ludwigs-Universitat
Freiburg im Breisgau
Transformation Knowledge in Pattern Analysis with Kernel
Methods –
Dekan: Prof. Dr. Jan G. Korvink
Prufungskommission: Prof. Dr. Wolfram Burgard (Vorsitz) Prof. Dr.
Luc De Raedt (Beisitz) Prof. Dr. Hans Burkhardt (Gutachter) Prof.
Dr. Bernhard Scholkopf (Gutachter)
Datum der Disputation: 18. November 2005
Acknowledgement
Firstly, I want to thank my supervisor Prof. Dr.-Ing. Hans
Burkhardt for giving me the possibility and wide support for the
research which has led to this thesis. In particular, the excellent
technical environment, the availability of various interesting
application fields and the scientific freedom have combined to be
an excellent basis for indepen- dent research. The generous support
of research travel enabled me to establish many important and
fruitful contacts. Similarly, I am deeply grateful to Prof. Dr.
Bernhard Scholkopf who was a constant source of motivation through
his own related work and various guiding hints, many of which find
themselves realized in the present thesis. I am very glad that he
agreed to act as the second referee. In particular, I am very
thankful for being given the opportunity to visit his group for a
talk, several weeks of research and the machine learning summer
school MLSS 2003. During these occa- sions, many fruitful
discussions were possible, especially with Dr. Ulrike von Luxburg,
Matthias Hein and Dr. Olivier Bousquet. Large parts of the
experiments were based on third party data which were kindly
provided by Dr. Elzbieta Pekalska, Dr. Thore Graepel, Daniel
Keysers and Rainer Typke. I also want to mention my former and cur-
rent colleagues at the pattern recognition group who contributed
through discussions, providing data and, last but not least,
encouragement when required. The whole group and also the members
of the associated group of Prof. Dr. Thomas Vetter provided a
wonderful, friendly and personal atmosphere, which played a very
important role for me. Therefore, I want to mention outstandingly
Nikos Canterakis, Olaf Ronneberger, Dr.-Ing. Lothar Bergen,
Dimitrios Katsoulas, Claus Bahlmann, Stefan Rahmann, Dr. Volker
Blanz and Klaus Peschke. A big “thank you” also goes to three of my
former students, Nicolai Mallig, Harald Stepputtis and Anselm
Vossen, who all contributed through discussions, ideas,
implementations and scientific results to the development of the
subjects in three main chapters.
Last but not least, I dedicate the thesis to other important
persons. On the one hand, to my parents, who supported the
unhindered development of my work in various ways. On the other
hand, to my girlfriend Heide, who also had to live with all the ups
and downs of my work during the last several years, but always
managed to remind me of other important things in life.
Kunheim, April 2005 Bernard Haasdonk
iii
iv
Zusammenfassung
Diese Dissertation konzentriert sich auf eine bestimmte Art von
Vorwissen, namlich Vorwissen uber Transformationen. Dies bedeutet,
dass explizite Kenntnis von Muster- variationen vorhanden ist,
welche die inharente Bedeutung der Objekte nicht oder nur
unwesentlich verandern. Beispiele sind rigide Bewegungen von 2D-
und 3D-Objekten oder Transformationen wie geringe Streckung,
Verschiebung oder Rotation von Buch- staben in der optischen
Zeichenerkennung. Es werden mehrere generische Methoden prasentiert
und untersucht, welche solches Vorwissen in Kernfunktionen
berucksichti- gen.
1. Invariante Distanzsubstitutions-Kerne (IDS-Kerne): In vielen
praktischen Fragestellungen sind die Transformationen implizit in
aus- gefeilten Distanzmaßen zwischen Objekten erfasst. Beispiele
sind nichtlineare De- formationsmodelle zwischen Bildern. Hier
wurde eine explizite Parametrisierung der Transformationen beliebig
viele Parameter benotigen. Solche Maße konnen in distanz- und
skalarprodukt-basierte Kerne eingebracht werden.
2. Tangentendistanz-Kerne (TD-Kerne): Spezielle Beispiele der
IDS-Kerne werden detaillierter untersucht, weil diese ef- fizient
berechnet und weit angewandt werden konnen. Wir setzen
differenzier- bare Transformationen der Muster voraus. Bei solchem
gegebenen Vorwissen kann man lineare Approximationen der
Transformations-Mannigfaltigkeiten kon- struieren und mittels
geeigneter Distanzfunktionen effizient zur Konstruktion von
Kernfunktionen verwenden.
3. Transformations-Integrations-Kerne (TI-Kerne): Die Technik der
Gruppen-Integration uber Transformationen zur Merkmalsextrak- tion
kann in geeigneter Weise erweitert werden auf Kernfunktionen und
allge- meinere Transformationen, die nicht notwendigerweise eine
Gruppe bilden.
v
vi
Theoretisch unterscheiden sich diese Verfahren darin, wie sie die
Transformationen reprasentieren und die Transformations-Weiten
regelbar sind. Grundlegender erweisen sich Kerne aus Kategorie 3
als positiv definit, Kerne der Gattung 1 und 2 sind nicht positiv
definit, was generell als notwendige Voraussetzung zur Verwendung
in Kern- methoden angesehen wird. Dies war die Motivation dafur zu
untersuchen, was die the- oretische Bedeutung von solchen
indefiniten Kernen ist. Das Ergebnis zeigt, dass diese Kerne auf
gegebenen Daten Skalarprodukte in pseudo-euklidischen Raumen
darstellen. In diesen haben bestimmte Kernmethoden, insbesondere
die SVM, eine sinnvolle geo- metrische und theoretische
Interpretation.
Zusatzlich zu theoretischen Eigenschaften wird die praktische
Anwendbarkeit der Kerne demonstriert. Fur diese Experimente wurde
Supportvektor-Klassifikation auf einer Vielzahl von Datensatzen
durchgefuhrt. Diese Datensatze umfassen Standard-
Benchmark-Datensatze der optischen Zeichenerkennung, wie USPS und
MNIST, und biologische Anwendungsdaten, die aus der
Raman-Mikrospektroskopie stammen und zur Identifikation von
Bakterien dienen.
Zusatzlich zur Erkenntnis, dass Transformations-Wissen auf
verschiedene Weise in Kernfunktionen eingebracht werden kann und
diese praktisch anwendbar sind, gibt es grundlegendere Einsichten
und Ausblicke. Wir demonstrieren und erlautern am Beispiel der SVM,
dass indefinite Kerne in Kernmethoden verwendet oder toleriert
werden konnen. Es existieren Aussagen uber den
Trainings-Algorithmus und die Eigen- schaften der Losungen und eine
sinnvolle geometrische Interpretation. Dies eroffnet im
Wesentlichen zwei Richtungen. Erstens vereinfachen diese Einsichten
den Prozess des Kerndesigns, welcher bislang hauptsachlich auf
positiv definite Kerne beschrankt war. Insbesondere eroffnet dies
die Moglichkeit der weiten Anwendbarkeit von SVM in an- deren
Gebieten wie distanzbasiertem Lernen, d.h. fur Analyse-Probleme,
bei denen Unterschiedsmaße zwischen Objekten verfugbar sind.
Zweitens erscheint die Unter- suchung der Anwendbarkeit von
indefiniten Kernen in weiteren Kernmethoden sehr
vielversprechend.
Abstract
Modern techniques for data analysis and machine learning are so
called kernel meth- ods. The most famous and successful one is
represented by the support vector machine (SVM) for classification
or regression tasks. Further examples are kernel principal
component analysis for feature extraction or other linear
classifiers like the kernel per- ceptron. The fundamental
ingredient in these methods is the choice of a kernel function,
which computes a similarity measure between two input objects. For
good generaliza- tion abilities of a learning algorithm it is
indispensable to incorporate problem-specific a-priori knowledge
into the learning process. The kernel function is an important ele-
ment for this.
This thesis focusses on a certain kind of a-priori knowledge namely
transformation knowledge. This comprises explicit knowledge of
pattern variations that do not or only slightly change the
pattern’s inherent meaning e.g. rigid movements of 2D/3D ob- jects
or transformations like slight stretching, shifting, rotation of
characters in optical character recognition etc. Several methods
for incorporating such knowledge in kernel functions are presented
and investigated.
1. Invariant distance substitution kernels (IDS-kernels): In many
practical questions the transformations are implicitly captured by
sophis- ticated distance measures between objects. Examples are
nonlinear deformation models between images. Here an explicit
parameterization would require an ar- bitrary number of parameters.
Such distances can be incorporated in distance- and
inner-product-based kernels.
2. Tangent distance kernels (TD-kernels): Specific instances of
IDS-kernels are investigated in more detail as these can be
efficiently computed. We assume differentiable transformations of
the patterns. Given such knowledge, one can construct linear
approximations of the transfor- mation manifolds and use these
efficiently for kernel construction by suitable distance
functions.
3. Transformation integration kernels (TI-kernels): The technique
of integration over transformation groups for feature extraction
can be extended to kernel functions and more general group,
non-group, discrete or continuous transformations in a suitable
way.
Theoretically, these approaches differ in the way the
transformations are represented and in the adjustability of the
transformation extent. More fundamentally, kernels from category 3
turn out to be positive definite, kernels of types 1 and 2 are not
positive definite, which is generally required for being usable in
kernel methods. This is the
vii
viii
motivation to investigate the theoretical meaning of such
indefinite kernels. The finding is that on given data these kernels
correspond to inner products in pseudo-Euclidean spaces. Here
certain kernel methods, in particular SVMs, have a reasonable
geometrical and theoretical interpretation.
Practical applicability of the kernels is demonstrated in addition
to the theoretical properties. For these experiments, support
vector classification on various types of data has been performed.
The datasets comprise standard benchmark datasets for optical
character recognition like USPS and MNIST or real-world biological
data resulting from micro-Raman-spectroscopy with the goal of
bacteria identification.
In addition to the demonstration that transformation knowledge can
be involved in kernel functions in different ways and that these
can be practically applied, there are more fundamental findings and
perspectives. We demonstrate and theoretically ar- gue that
indefinite kernels can be used or tolerated by kernel methods, as
exemplified for the SVM. There exist statements about the
training-algorithm, the resulting solu- tions and a reasonable
geometric interpretation. This opens up mainly two directions.
Firstly, these insights facilitate the process of kernel design,
which hitherto is mainly restricted to positive definite functions.
In particular, this enables SVMs to be used widely in other fields
like distance-based learning, i.e. in all analysis problems, where
dissimilarities between objects are available. Secondly, the
investigation of suitability or robustness of other kernel methods
than SVMs with respect to indefinite kernels seems very
promising.
Contents
1 Introduction 1 1.1 Pattern Analysis and Kernel Methods . . . . .
. . . . . . . . . . . . . . 1 1.2 Prior Knowledge by
Transformations . . . . . . . . . . . . . . . . . . . 3 1.3 Main
Motivating Questions . . . . . . . . . . . . . . . . . . . . . . .
. 5 1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . .
. . . . . . . . . 5
2 Background 7 2.1 Transformation Knowledge . . . . . . . . . . . .
. . . . . . . . . . . . . 7 2.2 Distances . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 9 2.3 Kernel Methods . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4
Support Vector Machines . . . . . . . . . . . . . . . . . . . . . .
. . . . 13 2.5 Goals for Invariance in Kernel Methods . . . . . . .
. . . . . . . . . . . 14 2.6 Related Work . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 15
3 Invariant Distance Substitution Kernels 19 3.1 Distance
Substitution Kernels . . . . . . . . . . . . . . . . . . . . . . .
19 3.2 Definiteness of DS-Kernels . . . . . . . . . . . . . . . . .
. . . . . . . . 21 3.3 Examples of Hilbertian Metrics . . . . . . .
. . . . . . . . . . . . . . . 24 3.4 Symmetrization . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Choice of
Origin O . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27 3.6 Transformation Knowledge in DS-Kernels . . . . . . . . . . .
. . . . . . 28 3.7 Related Work . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 32
4 Tangent Distance Kernels 35 4.1 Regularized Tangent Distance
Measures . . . . . . . . . . . . . . . . . 35 4.2 Definiteness of
TD-Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3
Invariance of TD-Kernels . . . . . . . . . . . . . . . . . . . . .
. . . . . 40 4.4 Separability Improvement . . . . . . . . . . . . .
. . . . . . . . . . . . 43 4.5 Computational Complexity . . . . . .
. . . . . . . . . . . . . . . . . . . 44 4.6 Related Work . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Transformation Integration Kernels 49 5.1 Partial
Haar-Integration Features . . . . . . . . . . . . . . . . . . . . .
49 5.2 Transformation Integration Kernels . . . . . . . . . . . . .
. . . . . . . 50 5.3 Invariance of TI-Kernels . . . . . . . . . . .
. . . . . . . . . . . . . . . 52 5.4 Separability Improvement . . .
. . . . . . . . . . . . . . . . . . . . . . 54 5.5 Computational
Complexity . . . . . . . . . . . . . . . . . . . . . . . . .
55
ix
x CONTENTS
5.6 Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 56 5.7 Acceleration . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 58 5.8 Related Work . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 59
6 Learning with Indefinite Kernels 61 6.1 Feature Space
Representation . . . . . . . . . . . . . . . . . . . . . . . 61 6.2
VC-bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 65 6.3 Convex Hull Separation in pE Spaces . . . . . . .
. . . . . . . . . . . . 66 6.4 SVM in pE Spaces . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 69 6.5 Uniqueness of
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.6 Practical Implications . . . . . . . . . . . . . . . . . . . .
. . . . . . . 74 6.7 Related Work . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 76
7 Experiments - Support Vector Classification 79 7.1 General
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . .
79
7.1.1 SVM Implementation . . . . . . . . . . . . . . . . . . . . .
. . . 79 7.1.2 Multiclass Architectures . . . . . . . . . . . . . .
. . . . . . . . 80 7.1.3 Model Selection . . . . . . . . . . . . .
. . . . . . . . . . . . . . 81
7.2 Invariant Distance Substitution Kernels . . . . . . . . . . . .
. . . . . . 82 7.2.1 Application of SVM Suitability Indicators . .
. . . . . . . . . . 83 7.2.2 Comparison to k-NN Classification . .
. . . . . . . . . . . . . . 85 7.2.3 Indefinite versus Positive
Definite Kernel Matrix . . . . . . . . . 87 7.2.4 Large Scale
Experiments . . . . . . . . . . . . . . . . . . . . . . 89 7.2.5
Summary of DS-Kernel Experiments . . . . . . . . . . . . . . .
90
7.3 Tangent Distance Kernels . . . . . . . . . . . . . . . . . . .
. . . . . . 91 7.3.1 USPS Digits . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 91 7.3.2 Micro-Raman Spectra . . . . . . .
. . . . . . . . . . . . . . . . 96 7.3.3 Summary of TD-Kernel
Experiments . . . . . . . . . . . . . . . 101
7.4 Transformation Integration Kernels . . . . . . . . . . . . . .
. . . . . . 102 7.4.1 Toy Examples . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 102 7.4.2 USPS Digits . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 102 7.4.3 Summary of TI-Kernel
Experiments . . . . . . . . . . . . . . . . 105
8 Summary and Conclusions 107 8.1 IDS and TD-Kernels . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 107 8.2 TI-Kernels . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.3 Indefinite Kernels in SVMs . . . . . . . . . . . . . . . . . .
. . . . . . . 110 8.4 Invariant Kernels versus Invariant
Representations . . . . . . . . . . . . 111 8.5 Perspectives . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
114
A Datasets 117 A.1 USPS Digits . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 117 A.2 MNIST Digits . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 118 A.3 Micro-Raman
Spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
A.4 Kimia . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 121 A.5 Unipen . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 122 A.6 Proteins . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.7
Cat-Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 124
CONTENTS xi
A.8 Music-EMD and Music-PTD . . . . . . . . . . . . . . . . . . . .
. . . . 125
B Mathematical Details 127 B.1 Invariance of TD-Kernels . . . . . .
. . . . . . . . . . . . . . . . . . . . 127 B.2 Relations of
Distances and Kernels . . . . . . . . . . . . . . . . . . . . 128
B.3 VC-Bound . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 128 B.4 Derivation of CH Classification . . . . . . .
. . . . . . . . . . . . . . . 129 B.5 CH Primal Optimization
Problem . . . . . . . . . . . . . . . . . . . . . 130 B.6
Equivalence of CH and SVM . . . . . . . . . . . . . . . . . . . . .
. . . 132 B.7 Uniqueness of Stationary Points . . . . . . . . . . .
. . . . . . . . . . . 134
C Notation 135
D Abbreviations 139
Introduction
In the present chapter, a short illustrative motivation for the
thesis will be given tar- geted at general readers. We comment on
the title concepts, hereby remaining slightly informal by avoiding
precise notions, definitions and formulas, which will follow in the
subsequent chapter. The motivating concepts will be underlined by
some intuitive di- agrams. We conclude this introductory part with
two central questions and comments on the structure of the
thesis.
1.1 Pattern Analysis and Kernel Methods
The main target of our considerations are pattern analysis systems.
This notion covers algorithms which are able to derive regularities
from given sets of data observations, and are able to perform some
kind of prediction on new observations based on the generated
regularities [119]. The most common examples are pattern
classification systems on which we will explain the general
concepts. A corresponding illustration is given in Fig. 1.1.
The typical pattern classification task consists of some type of
objects (e.g. hand- written letters as abstract entities) and some
finite set of target classes (e.g. the 26 classes of letters). The
goal is to have a system that assigns an estimated class label to
any formerly unseen object in the best possible way. The lower row
in Fig. 1.1 in- dicates, how such a system performs classification:
The abstract objects are observed, measured, discretized or
preprocessed in some way resulting in patterns (e.g. sequences of
digitized 2D points). The next step is the so called feature
extraction stage, where an arbitrarily structured pattern is
converted into a compact vectorial representation of numerical
values, the features. This so called feature vector is a simple
object that best possible represents discriminative information of
the original object. This feature vector representation can then be
fed into a classifier, which is the assignment rule, and results in
an estimated class label for each newly observed sample.
In order to result in a good estimate, such classifiers need a
learning or training stage, indicated in the upper row. This is
frequently based on a set of objects with known class labels, a so
called labelled training set. These training objects are subject to
the same measurement, preprocessing and feature extraction process
as the patterns during later prediction. But instead of prediction,
the set of feature vectors and their corresponding labels are used
to derive some model of the relation between the feature vectors
and their class labels. This model or hypothesis can be used to
build a classi-
1
classificationpreprocessing
preprocessing
Figure 1.1: Illustration of training and prediction stage of a
typical pattern clas- sification process. The abstract objects are
observed and preprocessed into concrete representations of
patterns. From each of these specific patterns a feature vector is
extracted, which allows a classification rule to be derived
(training) or which can be predicted by the obtained
classifier.
fication rule to predict the later unlabelled observations,
indicated by the black box in the upper right.
Such analysis systems can be used to solve various different
learning tasks:
Classification: Learn to assign unknown observations into a finite
number of cate- gories, denoted as classes.
Regression: Learn to assign unknown observations a real or
vector-valued quantity.
Clustering: Given a set of observations, find a grouping of the
points into clusters of similar points.
Novelty Detection: Given a set of observations, decide whether
unknown observa- tions are likely to be related to the set of
initial observations or not.
The list could be continued, but all of the tasks have the common
property of a training phase, where some data is given and a model
is derived and a prediction stage, where predictions on known or
unknown observations are performed.
These algorithms are of interest in various disciplines. The
derivation of regularities can be seen as a kind of learning, which
is some property commonly understood to be a characteristic of
intelligent systems. This is why such analysis algorithms are
investigated and applied in research fields ranging from pattern
recognition to machine learning and artificial intelligence. The
target of learning a reasonable hypothesis, i.e. a regularity which
is guaranteed to generalize to unseen examples, can be formalized
by methods from probability theory. Deriving statistical statements
for such pattern analysis algorithms is a topic of statistical
learning theory. For details on this very
1.2. PRIOR KNOWLEDGE BY TRANSFORMATIONS 3
interesting field of pattern analysis or pattern recognition, we
recommend the excellent standard textbooks [36, 125, 35].
A particularly successful branch of such analysis methods has
developed in the last decade, the so called kernel methods. These
are methods, which enable nice geometrical interpretations,
statistical generalization statements and efficient implementations
and yield excellent results on the variety of problem classes
mentioned above [130, 112, 119]. The fundamental ingredient in
these methods is a so called kernel function, which basically is a
similarity measure between two patterns. Investigation of such
kernel methods and, especially, designing kernel functions will be
the main focus of the thesis.
1.2 Prior Knowledge by Transformations
It is reasonable to expect that the performance of a suitable
learner increases with more knowledge of the learning task. The
knowledge of a learning task comprises the set of learning
observations and additional more abstract information, called prior
or a-priori knowledge. Figure 1.2 illustrates this relation
qualitatively. By increasing the sample number and/or the
incorporated prior knowledge, the performance of an artificial or
biological learning system (including humans) is expected to
increase as indicated in the first and second plot. On the other
hand, if a specific performance is wanted, this can be obtained by
fewer number of examples if more prior knowledge is available. This
is not only intuitively understandable, but frequently observed in
machine learning. Also, theoretical statements about the importance
of prior knowledge exist, e.g. the no free lunch theorem, the
conclusion of which is summarized in [36, p. 454]: “If the goal is
to obtain good generalization performance, there are no
context-independent or usage-independent reasons to favor one
learning or classification method over another.” A similar
statement for the choice of patterns is given by the ugly duckling
theorem which is similarly summarized [36, p. 458]: “in the absence
of assumptions there is no privileged or ’best’ feature
representation.” So, both state that no generalization is possible
without any prior assumptions in the pattern recognition
chain.
The “number of examples” is often expensive in both time and
material aspects. E.g. the measurements can be difficult or the
labelling of samples requires human interaction, possibly even from
a high expert. Also, during the learning process the number of
examples is relevant. Due to the frequent super-linear time and
memory complexity of learning methods, multiplicity of training
sets often results in unfeasible large training times. Thus,
incorporation of prior knowledge into a learning system other than
by training examples is of utter importance.
A very frequent type of a-priori knowledge is transformation
knowledge, i.e. knowl- edge about typical variations of single
patterns. For instance, it might be known that some modification of
an observation keeps its inherent meaning unchanged. In the ex-
ample of the handwritten letter concept, it is irrelevant how large
the individual letter is. Changing the size keeps the inherent
meaning of the letter unchanged. This insight makes learning
easier: Instead of requiring instances of all letters in all sizes,
merely one instance per letter type is sufficient.
In practice, different “kinds” of transformations can be found and
applied in learn- ing tasks. In general, we require that some
meaningful transformations of objects are known and can be
modelled. Here we want to note, that we use the notion
transfor-
4 CHAPTER 1. INTRODUCTION
a) b) c)
Figure 1.2: Influence of the number of examples and prior knowledge
on the recognition performance. a), b) Improvement of
generalization with increasing number of samples or enhanced prior
knowledge, c) for similar performance, fewer training examples are
required if more knowledge is available.
mation for arbitrary mappings of objects to objects. We do not
require invertibiliy, bijectivity, etc. as often found in
traditional examples like the Fourier, Laplace or wavelet
transform. We will give some simple examples of transformations by
referring to the case of image data, where meaningful variations of
a pattern can be formalized explicitly. In Figure 1.3 b)-h) some
transformations of a specific image pattern depicted in a) are
given. These transformations have a different nature. Most of them
can be continuously or even differentiably parameterized as the
translation b), the rotation c), scaling e), contrast reduction g)
and smoothing h). In contrast, reflection d) or morphological
operations like erosion f) require a discrete parameterization.
Some of the transformations can be modelled as so called
transformation groups or Lie groups [64] in the case of
differentiable transformations. This implies in particular that the
transformations can be reversed. Examples are most of the given
transformations if the image is assumed to be embedded in the real
plane. But the requirement of re- garding transformations stemming
from a group is too strict in many practical cases. Firstly, in
some applications only parts of transformation groups will be
reasonable. For instance, in optical character recognition (OCR)
only small rotations are accept- able for maintaining a letter’s
class: Large rotations will confuse M/W, N/Z or 6/9. Secondly,
non-reversible transformations can occur, which consequently cannot
be ex- actly modelled by transformation groups. An example is the
line thinning operation by morphological erosion in f). This will
produce a uniformly white image for any non-empty input pattern. In
general, this reversibility is violated as soon as different
objects are transformed to one common pattern by a particular
transformation, which implies that information is lost. For
instance, large enough non-cyclic translations of images will
result in the zero image for any input image if incoming pixels are
set to zero and non-zero image pixels are lost. Other
transformations operating on image matrices like deleting single
pixels, rows, columns etc. are also non-reversible lossy
transformations, which might still be reasonable for modelling
certain noise, squeezing or general elastic deformations of an
image.
The observations of the variety of transformations is not
restricted to image data. For instance, in string-processing the
deletion of a single character also is a non- reversible
transformation, but represents a meaningful transformation as the
result- ing string has small edit-distance to the original. In the
present work we will take both real-world image and non-image data
for demonstrating the applicability of our approaches. These
different natures of the transformations depending on the
applica-
1.3. MAIN MOTIVATING QUESTIONS 5
Figure 1.3: Illustration of image transformations. a) Original
pattern, b) translation, c) rotation, d) reflection, e) scaling, f)
erosion, g) contrast reduction, h) smoothing.
tions will require different means for being incorporated in kernel
methods and kernel functions.
1.3 Main Motivating Questions
The starting point for the thesis is the observation that
transformation knowledge in kernel methods is established mainly
for either finitely many, small transformations, i.e. local
invariance with respect to these transformations [34, 109], or
global invariances by invariant feature vectors and subsequent
application of common kernels [18, 98]. The whole spectrum of more
general transformation knowledge as indicated in the previous
section has not been properly investigated. The main questions
therefore are:
• How can infinite or finite, group or non-group, continuous or
discrete transfor- mations be incorporated in kernel
functions?
• What are practical and theoretical properties and consequences of
the resulting kernels?
The relevance of the study will mainly be to obtain building blocks
for invariance in kernel methods by the proposed kernel functions.
Further, a unification of global invariant feature extraction and
local invariance in machine learning is obtained. A smooth
transition between the non-invariant, locally invariant and totally
invariant cases will be possible. New theoretical investigations in
the field of indefinite kernels in machine learning will be
required and performed. In particular, this promises wide
application and theoretical perspectives for general kernel
methods.
1.4 Structure of the Thesis
After this introduction, Chapter 2 starts with the required
background for the present work. Basic notions and common aspects
are formally introduced, on which the re- maining exposition
builds. The first contribution of this thesis is a part consisting
of
6 CHAPTER 1. INTRODUCTION
the Chapters 3 to 5, which propose kernel functions that include
transformation knowl- edge. The kernels will be based on two
generic approaches, namely distance measures and integration. Only
experiments for enhancing the understanding of the concepts are
presented in these first chapters on kernels, whereas real world
performance analysis is shifted to the later experimental Chapter
7. Chapter 3 assumes that suitable distance measures are given,
from which corresponding distance substitution kernel functions are
constructed as presented in [49]. By involving invariant distances,
this results in invariant distance substitution kernels. Chapter 4
focusses on a special instance of the kernels from the preceding
chapter. It assumes that the invariance is explicitly given as
transformations of vector valued patterns. If these transformations
are continuous and differentiable, tangent distance kernels can be
efficiently constructed. These have been presented in [51]. The
second kind of generic approach is presented in Chap. 5, where the
technique of integration over transformations, as used for
invariant features, is applied to kernel functions. By this, one
can construct kernels, which capture wanted invariances. The
presentation is an extension of the one in [52] to arbitrary
continuous, discrete, group or non-group transformations.
Some of the given approaches lead to kernels, which are not
positive definite. This property is commonly required for
application of kernels in methods like SVMs. Chap- ter 6 therefore
investigates the effects of indefinite kernels. It concludes that
the com- mon geometric feature space interpretation can be extended
to these cases and common algorithms like SVMs can be used with
indefinite kernels. However, some care has to be taken and
corresponding suitability criteria are presented. This has mainly
been published in [48].
The last part of the thesis, consisting of Chapter 7, presents
application and ex- perimental evaluation of the presented kernels.
The flexibility, applicability, but also limitations of the
approaches are demonstrated by experiments on a variety of prob-
lems including well-known benchmark settings and data from current
research projects. The experiments will also include comparisons
with state-of-the-art methods. Chap- ter 8 summarizes and comments
on the main findings of the present work and gives perspectives for
possible further research. The Appendix provides information on the
used datasets, mathematical details which are left out from the
main text for better readability, a list of general notation and
frequent abbreviations.
Chapter 2
Background
This chapter introduces the background on which the thesis is
based. This consists of the formalization of transformation
knowledge and invariance as illustrated in the introduction, the
required notation for kernel methods in general and in particular
one instance, the support vector machine. We list the detailed
goals for the approaches to be developed in the sequel. For a
collected list of notation and symbols we refer to the table in
Appendix C.
2.1 Transformation Knowledge
The basic starting point in pattern analysis is the representation
of the objects un- der investigation. As indicated in the pattern
analysis chain in Fig. 1.1, the abstract objects are not accessible
directly, but mainly based on certain representations from
observations or higher level post-processing, feature extraction
etc. The resulting rep- resentations will be called patterns:
Definition 1 (Pattern). We denote an element x from a pattern space
X as a pattern. Patterns play the role of the observations of
objects which are to be processed by pattern analysis
techniques.
A pattern is in particular not required to be vectorial, it can be
an arbitrary object. If it occasionally is a vectorial
representation, it will be denoted boldface x ∈ X , as done for
general vectorial variables.
We make some structural assumption on the generation of the
patterns. We restrict to the knowledge of a set of transformed
patterns for each sample. The transformation knowledge is the
assumption that these patterns have equal or similar meaning as the
original pattern itself. So, replacing an individual point by one
of its transformed patterns should keep the output of the analysis
task roughly unchanged. As argued in the preceding chapter, these
transformations can be of different kind, which are all covered by
the following formalization:
Definition 2 (Transformation Knowledge). We assume to have a set T
of transfor- mations t : X → X which define a set of transformed
patterns Tx := {t(x)|t ∈ T} ⊂ X for any x. These patterns are
assumed to have identical or similar inherent meaning as the
pattern x itself.
7
8 CHAPTER 2. BACKGROUND
At this point we do not put any further assumptions on T . In
particular, we do not assume an explicit parameterization of these
sets, nor assume that they are finite. We neither require specific
relations between the Tx, Tx′ of different patterns. They may be
disjoint, may be equal or may intersect. Of course for
computationally dealing with these sets, one must assume
enumerability of the sets, characterization of the sets by
constraints or explicit parameterization of the transformations. If
such a parameter- ization with real valued parameter vector p
exists, we denote the transformations as t(x,p).
A traditional way in pattern analysis for involving such
transformation knowledge is to perform a preprocessing step by
mapping the objects into an invariant repre- sentation in some real
vector space H. Instead of working on the original patterns, the
transformed samples are taken as a basis for investigation. This
step was denoted feature extraction in Fig. 1.1.
Definition 3 (Invariant Function, Single Argument). We call a
function f : X → H invariant, if for all patterns x and all
transformations t ∈ T holds f(x) = f(t(x)). The vector f(x) is then
called an invariant representation or invariant feature vector of
x.
To emphasize the invariance of an arbitrary function f(x), we will
occasionally denote it I(x). In traditional invariant pattern
analysis, this notion is used for trans- formation groups T = G,
cf. [116, 115, 139]. In this case the pattern space X is nicely
partitioned into equivalence classes Tx, which correspond to the
orbits of the patterns under the group action. Then, invariants are
exactly those functions, which are con- stant on each equivalence
class, cf. Fig. 2.1 a). Various methods for constructing such
invariant features are known, e.g. normalization approaches like
moments [20], aver- aging methods [114] or differential approaches.
For a general overview of invariance in pattern recognition and
computer vision we refer to [18, 85]. This definition of
invariance, however, is also meaningful for equivalence classes not
induced by groups.
Pattern analysis targets can often be modelled as functions of
several input objects, for instance the training data and the data
for which predictions are required. For such functions different
notions of invariance can be given, each with its own prac- tical
relevance. For distinguishing between these different definitions,
we introduce discriminating extensions of Def. 3, which will be
used throughout the thesis.
Definition 4 (Invariant Function, Several Arguments). We call a
function f : X n → H
i) simultaneously invariant, if for all patterns x1, . . . , xn ∈ X
and transformations t ∈ T holds
f(x1, . . . , xn) = f(t(x1), . . . , t(xn)).
ii) totally invariant, if for all patterns x1, . . . , xn ∈ X and
transformations t1, . . . , tn ∈ T holds
f(x1, . . . , xn) = f(t1(x1), . . . , tn(xn)).
Obviously, for the case of a function with a single argument, both
definitions cor- respond to the invariance according to Def. 3. The
first notion i) is used in [116] for polynomial functions under
group transformations. In general, this is a common understanding
of invariance. The function does not change if the whole space X is
globally transformed, i.e. all inputs are transformed
simultaneously with an identical
2.2. DISTANCES 9
transformation, cf. Fig. 2.1 b). For example, the Euclidean
distance is called translation invariant, the standard inner
product rotation invariant [132, 112]. From a practical viewpoint,
this type of invariance is useful, as it guarantees that the
function is in- dependent of the global constellation of the data.
By this it is unaffected, e.g. by changes of the experimental
setup: A simultaneously translation-invariant system can operate on
data without preprocessing like centering. A simultaneously
scale-invariant system will produce the same output on differently
scaled datasets, making a uniform scale-normalization superfluous
etc. So, these transformations can be ignored in the consecutive
analysis chain.
This notion, however, does not capture the transformation knowledge
as given above: It only guarantees to remain constant under global
transformation of the whole input space. However, if we only
translate/rotate one of the several patterns, the Eu- clidean
distance and the inner product will in general change. Therefore,
we introduce the notion ii) of total invariance to denote
functions, which are guaranteed to maintain their value, if any
single argument is (or equivalently all simultaneously are) trans-
formed independently, cf. Fig. 2.1 c). Note that this is equivalent
to the statement that they are invariant as functions of one
argument fixing the remaining ones arbitrarily. The total
invariance ii) implies the simultaneous invariance i).
Further variations of invariance exist in invariant theory, such as
relative versus absolute invariance, covariance, semi-invariance,
etc. [116]. These notions, however, are not relevant in the
sequel.
Note that the requirement of invariance is sometimes too strict for
practical prob- lems. The points within Tx are sometimes not to be
regarded as identical, but only as similar, where the similarity
can even vary over Tx. Such approximate invariance is called
transformation tolerance in [139], quasi-invariance in [12] or
denoted additive invariance in [16]. A well-known and intuitive
example is optical character recognition (OCR): The sets Tx of
’similar’ patterns might be defined as rotations of the pattern x
under small rotation angles. Exact invariance is not wanted with
respect to these transformations. An invariant function namely is
not only invariant with respect to T , but due to transitivity, it
is invariant to the larger set of finite compositions T of
transformations from T , i.e. T0 := T, Ti+1 := T Ti and T :=
∪∞
i=0Ti. The relation x ∼ x′ :⇔ Tx ⊂ Tx′ or Tx′ ⊂ Tx exactly is the
desired equivalence relation. Intuitively this means, that in OCR
an invariant function f will not only be constant on the set of
small rotations of a pattern, but by transitivity it must be
constant for all rotations. As indicated in the introduction, this
results in the letters M/W, Z/N, 6/9 etc. not being
discriminable.
We want to cover both the exact invariance and these transformation
tolerant cases. Therefore, the kernels proposed in the thesis are
in general not precisely invariant, but capture the similarity of
the patterns along Tx. In all cases where exact invariance is
wanted and the sets Tx form a partition of X , the proposed tools
will result in totally invariant functions.
2.2 Distances
Some constructions in the sequel will be based on suitable
dissimilarity measures and are in principle also applicable to
arbitrary non-invariant proximity data. So, we con-
10 CHAPTER 2. BACKGROUND
f(t′(x), t′(x′))
f(x, t′(x′))
Figure 2.1: Illustration of different notions of invariance. a)
Invariance of a function with a single argument, b) simultaneous
invariance of a function with multiple argu- ments, c) total
invariance of a function with multiple arguments.
tinue with settling the basic notions related to this. In the
present work, we call arbitrary functions which represent some
measure of proximity or dissimilarity of pat- terns as
dissimilarity functions. In contrast to this unspecified notion, we
take the following definition of a distance function.
Definition 5 (Distance, Distance Matrix). A function d : X × X → R
will be called a distance function if it is nonnegative and has
zero diagonal, i.e. d(x, x′) ≥ 0 and d(x, x) = 0 for all x, x′ ∈ X
. Given a set of observations xi ∈ X , i = 1, . . . , n the matrix
D := (d(xi, xj))
n i,j=1 is called the distance matrix.
If a dissimilarity function d does not satisfy these conditions, it
can easily get zero-diagonal by d(x, x′) := d(x, x′) − 1
2 (d(x, x) + d(x′, x′)) and made nonnegative by
2.3. KERNEL METHODS 11
d(x, x′) := |d(x, x′)|. In particular, this definition is more
general than a metric, as we allow non-symmetric distances. We
further do not require the triangle inequality to be satisfied, and
we allow d(x, x′) = 0 for patterns x 6= x′.
The reason for fixing such a definition is, that the notion
distance is treated quite diversely in literature. Different
assumptions and requirements for a notion of distance have been
discussed. At least 12 systematically differing notions of
proximities can be found in [31]. The possibilities range from the
strong structure S1 (Euclidean distance) to the highly unstructured
S12, where the dissimilarity has no quantitative meaning except
dividing the set into equivalence classes of similar objects. Our
definition above refers to the structure denoted S4. If d
additionally is symmetric, the corresponding structure is S3.
Further categorizations of dissimilarity measures exist, e.g.
[136].
In general, different choices of the distance model, imply more or
less mathematical theory to be available for analysis. For
instance, restricting to metrics would allow mathematical
background by well-studied metric spaces. Such strict assumptions
are, however, often not satisfied by experimental measurements,
which are to be analyzed. Hence, the reason for our notion above is
mainly practical: When constructing dissim- ilarity measures in
practice, it is hard to imply special prior-knowledge while
satisfying strict conditions. When investigating given
dissimilarities, the data source frequently does not care about
nice mathematical properties. Consequently, various existing es-
tablished dissimilarity measures do not satisfy certain conditions.
For instance, it has already been stated in [90] that many
dissimilarity measures in pattern analysis and computer vision turn
out to be non-metric. In psychometric experiments, frequently
non-symmetric dissimilarities are observed. In particular, the
former will be exemplified in the sequel.
Special distances which can be interpreted as an L2-norm, i.e. the
metric of a Hilbert space, will play a role in the sequel, so we
define similar to [57]:
Definition 6 (Hilbertian Metric, Euclidean Distance Matrix). A
distance function d on the pattern space X is called a Hilbertian
Metric if the data can be embedded in a real Hilbert space H by Φ :
X → H such that d(x, x′) = Φ(x) − Φ(x′). A matrix D is called
Euclidean distance matrix if it is a distance matrix of a
Hilbertian metric.
Note that these distances necessarily are semi-metrics, i.e. being
symmetric and satisfying the triangle inequality. A Hilbertian
metric d, however, does not have to be a metric, as different
patterns x 6= x′ can still result in d(x, x′) = 0. But in this
case, those patterns in general will not be discriminable, as for
all other points x′′ holds d(x, x′′) = d(x′, x′′), due to the
triangle inequality and the fact, that x and x′ must be mapped to
the same point in any embedding Hilbert space. The motivation for
the latter notion of Euclidean distance matrix is, that in the case
of finite data, the span of these finitely many points in any
embedding real Hilbert space is finite dimensional, so isomorphic
to a real Euclidean space.
2.3 Kernel Methods
From dissimilarities, we turn our focus on the class of similarity
measures, the so called kernels, which are the fundamental
ingredient in kernel methods, cf. [112]. We restrict ourselves to
real valued kernels, as the resulting applications only require
those, though complex valued definitions are possible, cf.
[11].
12 CHAPTER 2. BACKGROUND
Definition 7 (Kernel, Kernel Matrix). A function k : X ×X → R which
is symmetric is called a kernel. Given a set of observations xi ∈ X
, i = 1, . . . , n the matrix K := (k(xi, xj))
n i,j=1 is called the kernel matrix.
We emphasize that this use of the notion kernel is wider than
frequently used in literature, which often requires positive
definiteness, defined below. As we also will discuss non positive
definite functions to be used, we extend the notion kernel also
covering these cases.
Definition 8 (Definiteness). A kernel k is called positive definite
(pd), if for all n and all sets of data points (xi)
n i=1 ⊂ X n the kernel matrix K is positive semi-definite,
i.e. for all vectors v ∈ R n holds vTKv ≥ 0. If this is only
satisfied for those v with
1T nv = 0, then k is called conditionally positive definite (cpd).
A kernel is indefinite, if
a kernel matrix K exists, which is indefinite, i.e. vectors v and
v′ exist with vTKv > 0 and v′TKv′ < 0.
The cpd kernels are related to pd kernels [112, Prop. 2.22], as for
any choice of origin O ∈ X the kernel k is cpd if and only if k is
pd, where k is given as
k(x, x′) = k(x, x′) − k(x,O) − k(O, x′) + k(O,O). (2.1)
We denote some particular inner-product- and distance-based kernels
by
klin(x,x′) := x,x′ knd(x,x′) := −x − x′β , β ∈ [0, 2]
kpol(x,x′) := (1 + γ x,x′)p krbf(x,x′) := e−γx−x
′2
, p ∈ N, γ ∈ R+.
Here, the linear, polynomial and Gaussian radial basis function
(rbf) are pd for the given parameter ranges. The negative distance
kernel is cpd, which is completely sufficient for application in
certain kernel methods such as SVMs, cf. [11, 107]. We now come to
the main field of application of such kernel functions in pattern
analysis, which are the so called kernel methods, cf. [119, 112]
and references therein for details on the notions and concepts. In
general, a kernel method is a nonlinear data analysis method for
patterns from some set x ∈ X , which is obtained by application of
the kernel trick on a given linear method: Assume some linear
analysis method operating on vectors x from some Hilbert space H,
which only accesses patterns x in terms of inner products x,x′.
Examples of such methods are principal component analysis (PCA),
linear classifiers like the perceptron or Fisher linear
discriminant etc. If we assume some nonlinear mapping Φ : X → H,
the linear method can be applied on the images Φ(x) as long as the
inner products Φ(x), Φ(x′) are available. This results in a
nonlinear analysis method on the original space X . The kernel
trick now consists in replacing these inner products by a kernel
function k(x, x′) := Φ(x), Φ(x′): As soon as the kernel function k
is known, the Hilbert space H and the particular embedding Φ are no
longer required. For suitable choice of kernel function k, one
obtains methods, which are very expressive due to the nonlinearity
and cheap to compute, as explicit embeddings are omitted.
The question, whether a given kernel allows a representation in a
Hilbert space as k(x, x′) = Φ(x), Φ(x′) is interestingly completely
characterized by the positive definiteness of the kernel. Various
methods of explicit feature space construction can be given.
Theoretically most relevant is the so called reproducing kernel
Hilbert space
2.4. SUPPORT VECTOR MACHINES 13
(RKHS), which enables the embedding of the whole space X into a
Hilbert space of functions. Practically most intuitive is the
empirical kernel PCA map, which performs an embedding of given data
points into a finite dimensional Euclidean vector space. A
modification of this will be applied in Chap. 6, where embeddings
of indefinite kernels will be the basis for further
investigation.
By this kernel trick, various kernel methods have been developed
and successfully applied in the last decade. This enables a wide
variety of analysis algorithms, once a suitable kernel is chosen
for the data. In addition to the flexibility of analysis al-
gorithms, also a multitude of kernel functions are being developed.
These are not restricted to vectorial representations of the
objects in contrast to many traditional methods. Instead, there are
kernels for a wide range of structured or unstructured data types,
e.g. general discrete structures [54], data-sequences [135],
strings [77], weighted automata [30], dynamical systems [124]
etc.
This modularity of choice of kernel function and choice of analysis
method is a major feature of kernel methods. Additionally, this
emphasizes the importance of the kernel choice: The only view on
the data, which the analysis algorithm obtains, is the kernel
matrix K, i.e. the kernel evaluated on all pairs of input objects.
Therefore, the kernel matrix is very reasonably denoted an
information bottleneck in any such analysis system [119]. Hence, a
good solution for any analysis task will require a well designed
kernel function.
2.4 Support Vector Machines
The most popular representative of kernel methods is the support
vector machine (SVM) for classification problems. This will also be
the main method, on which the proposed kernels are applied in Chap.
7. In recent years SVMs have been established as methods of first
choice on various learning problems in many fields of applications,
cf. [33, 119]. There are several reasons for their success. The
main arguments for practitioners are existing fast implementations
like SVMLIGHT [63], LIBSVM [21], LIBSVMTL [97] and the general ease
of use, as in SVMs only few architectural deci- sions have to be
taken: Only a positive definite kernel function and some parameters
have to be provided. Even the choice of these few parameters can be
automatized by model selection strategies, e.g. [26].
Very important arguments for theoreticians are the foundations in
statistical learn- ing theory and the clear intuitive geometric
interpretation [129]. SVMs are hyperplane classifiers in implicitly
defined Hilbert spaces. They perform optimal separation of pat-
terns by margin maximization. This is the basis for general
understanding, adequate practical application and improvements.
However, this learning theoretic and geomet- ric interpretation is
only available in the case of conditionally positive definite (cpd)
kernel functions, which will be extended to the non-cpd case by
this thesis in Chap. 6.
The formal derivation of the SVM training and classification
methodology can be found in almost every tutorial and textbook on
SVMs, e.g. [129, 15, 112]. It starts with the geometrical
motivation of large margin separation with linear decision lines,
derives the dual optimization problem resulting from the intuitive
primal problem, introduces slack-variables for tolerating outliers
and performs the kernelization step resulting in the powerful
nonlinear SVM. So, we refrain from reproducing these detailed
derivations and
14 CHAPTER 2. BACKGROUND
state the final optimization problem corresponding to the so called
C-SVM. Assuming training data (xi, yi) ∈ X × {±1} for i = 1, . . .
, n and a kernel function k, the usual SVM classification approach
solves the dual optimization problem which we will refer to as
(SVM-DU)
max α1,...,αn
i αiyi = 0.
Here C > 0 is a factor to be chosen in advance penalizing data
fitting errors, in partic- ular C = ∞ results in the hard margin
SVM, which tolerates no outliers. Lowering C increases the
tolerance against margin violations and results in the soft margin
SVM. The classification of new patterns x is then based on the sign
of
f(x) = ∑
i
αiyik(xi, x) + b, (2.3)
where b is determined such that f has identical absolute values on
unbounded SVs, e.g. by choosing xk and xl of opposite classes yk =
+1 and yl = −1 with 0 < αk, αl < C and setting
b := −1
. (2.4)
Interpretation of SVMs as separation of convex hulls [32, 10] even
suggests, that the value b could be chosen slightly different in
the soft margin case.
2.5 Goals for Invariance in Kernel Methods
SVMs are regarded as discriminative approaches in contrast to
generative approaches. The latter involve structural knowledge
about the generative origin of the patterns. A reasonable goal is
stated in [59]: “An ideal classifier should combine these two com-
plementary approaches.” This also holds for other discriminative
kernel methods. As transformation knowledge can be seen as a kind
of generative knowledge, the approaches described in the thesis can
be seen as combining generative and discriminative ideas, if used
in discriminative analysis methods. We will briefly list the
desired properties, which will be addressed and satisfied by the
proposed approaches.
1. Various Kernel Methods: The approaches should not be
learning-task-specific, especially not restricted to SVMs or other
optimization-based techniques. In principle, the approaches should
support arbitrary analysis methods implicitly working in the
kernel-induced feature space.
2. Various Kernel types: The approaches should be applicable to
various kernel functions. In particular, arbitrary distance-based
or inner-product-based kernels should be supported in contrast to
methods which are only used with the Gaussian rbf.
3. Various Transformations: The methods should allow to model both
infinite and finite sets of transformations. The transformations
should comprise group- and non-group transformations, discrete and
continuous ones. The modelled T should possibly consist of
exponential many combinations of such base transformations.
2.6. RELATED WORK 15
4. Adjustable Invariance: The extent of the invariance should be
explicitly ad- justable from the non-invariant to the totally
invariant case. In cases, where the sets Tx are a partition of the
pattern space, exact invariance should be possible.
5. Separability Invariance alone is not a desirable property, e.g.
any constant func- tion is invariant, but can not be reasonably
used for pattern analysis. Equally important is the separability of
the methods, i.e. the ability of discriminating patterns. A
corresponding improvement must be demonstrated.
6. Complexity: The computation of the kernels should be efficient,
in particular cheaper than template matching, i.e. generating and
comparing all transformed samples. The proposed methods should
possibly not explicitly compute all trans- formed patterns, but
rather avoid large parts of these. Additional acceleration by
possible precomputations of certain pattern-specific quantities and
reuse of other preceding results should be possible. The methods
should result in adequately sized models.
7. Applicability: The applicability on real world problems should
be possible with respect to both computational demands and good
generalization performance. At least standard benchmark datasets
must be possible to treat. The methods should compete with or
outperform existing approaches.
The first goal will be clearly satisfied, if we focus on
construction of kernel func- tions. Practical applications,
however, will in the sequel only be demonstrated for SVMs. In Fig.
2.2 we indicate a general motivation for considering invariant
kernels. They nicely interpolate between the well-known methods of
invariant features applied in ordinary kernels and the other
extreme of template matching. Invariant kernels are therefore
conceptionally covering but also extending these traditional
methods. The main difference between the approaches can be seen in
the distribution of the com- putational load indicated by the area
of the blue rectangles. The kernel computation process can be
divided into a stage of sample-wise precomputations, e.g. the
invariant features for two patterns, followed by the kernel
evaluation, which uses the precom- puted quantities. In the left it
can be seen, that the main contribution is the expensive
computation of invariant features, whereas the kernel evaluation is
quite cheap by usu- ally few arithmetic operations in the simple
standard kernels. In the case of template matching on the right,
not many precomputations are possible, the computational load is
concentrated on the evaluation of the kernel, which involves
matching all transformed patterns against each other. Invariant
kernels between those extremes will allow some sample-wise
precomputations of (possibly non-invariant) quantities and require
some more than trivial computations for kernel evaluation than the
simple kernels.
A more detailed discussion of the benefits of invariant kernel
functions and their relation to invariant features will follow in
the concluding Chapter 8.
2.6 Related Work
Few publications on invariance in kernel methods existed at the
initial phase of the thesis, but meanwhile further methods have
been proposed by other parties. Some of these methods are
conceptionally very interesting, however, none of the methods
meets
16 CHAPTER 2. BACKGROUND
+
′(xj))
Figure 2.2: Qualitative complexity comparison of invariant kernels
covering the whole spectrum from invariant feature extraction to
template matching. The computational weight is shifted from
sample-wise precomputations to the kernel evaluation.
all of the requirements listed in the previous section.
Presentations of different tech- niques for combining the
information of transformation invariances with SV-learning are
presented, e.g. in [34, 112]. We give details of two methods
described there, namely the virtual support vector and the
jittering kernels method, as they are most widely accepted and
therefore used for comparisons in the sequel. Further methods are
briefly mentioned. In the following chapters, separate sections
will be devoted to additional specific related work.
Virtual Support Vector Method: The virtual support vector (VSV)
method mod- ifies the training set. The starting point is the idea
of generating virtual training data by applying a finite number nT
of transformations T = {ti}nT
i=1. on the training points and training on this extended set. This
approach can be traced back to [87, 92].
As the size of this extended set is a multiple of the original one,
training is computationally demanding, in particular if training
time scales polynomially with the training samples, as with SVMs.
To reduce this burden, the VSV- method is a two step method: First
an ordinary SVM training is performed on the original training set,
then the set of resulting SVs is extracted, which usually has a
largely reduced size, these SVs are multiplied by small
transformations, finally a second SVM training is performed on this
extended set of samples.
The advantage of training set modification is that no kernel
modification is re- quired. All standard kernels can be applied.
Particularly, positive definiteness is guaranteed. This is the
reason for the VSV method to be the most widely accepted method for
invariances in SVMs.
Problems are the two training stages and a highly increased number
of SVs after the second stage, which leads to longer training and
classification times [109]. Ad- ditionally, this method is only
applicable for finitely many transformed samples.
2.6. RELATED WORK 17
The independent combination of different transformations results in
an exponen- tial growing number of virtual SVs, which is
prohibitive for several simultaneous transformations. So, this
method has some problems with the goals 1, 3 and 6.
Jittering Kernels: The so called jittering kernels are also based
on the idea of a finite number of small transformations T =
{ti}nT
i=1 of the training points. Instead of performing these
transformations before training, they are performed during kernel
evaluation. Starting with an arbitrary kernel function k, the
computation of the jittered kernel kJ(x, x′) is done in two steps:
Firstly, determine the points minimizing the distance of the sets
Φ(Tx) and Φ(Tx′) in the kernel induced feature space by explicitly
performing all transformations and computing the squared distances.
Secondly, take the original kernel at these points:
(i, i′) := arg min i,i′=1,...,nT
k(ti(x), ti(x)) − 2k(ti(x), ti′(x ′)) + k(ti′(x
′), ti′(x ′))
kJ(x, x′) := k(ti(x), ti′(x ′)). (2.5)
This scheme nicely reflects the idea of operating in the
kernel-induced feature space. It is, however, only well-defined for
kernels with k(x, x) = const like distance-based kernels. For
inner-product-based kernels the definition is not proper. The
reason is that occasionally multiple minima (i, i′) of the
nonlinear optimization problem can yield different values if
inserted in the kernel function. So, it is undecided, what kernel
value is to be used in this case. Additionally, the minimization of
the distance between the so called jittered example sets destroys
the positive definiteness of the base kernel. The method is only
applicable to finitely many transformed patterns. So, this method
does not completely meet the goals 2, 3 and 6.
Further Methods: In [16], a theoretical study of invariance in
kernel methods is for- malized for differentiable transformations.
In our terms, the main idea can be expressed as requiring that the
kernel function does not change in transforma- tion directions by
corresponding differential equations, i.e. for all x and x0 holds
∇xk(x0,x) · ∂pt(x, p) = 0. This yields partial differential
equations, which must be solved to obtain invariant kernel
functions. This turns out to be a highly non- trivial task as many
integrals have to be determined. For academic examples of
cyclically translating a vector with few entries this seems to be
applicable, as the nontrivial exact integrals solving the PDEs can
be stated. This method, however, is far from being applicable to
even small sized image data.
The recent method of tangent vector kernels [93] mainly reuses the
motivation of the tangent distance kernels presented in Chap. 4,
where additional comments will follow.
The method called invariant hyperplane [110] or the nonlinear
extension called invariant SVM [24] modify the method’s specific
optimization target. It tries to alter the hyperplane in such a
way, that it globally fits all local invariant directions best.
This turns out to be equivalent to a pre-whitening in feature space
along these directions which fit best to all local invariance
directions si- multaneously. The advantage is the use of the
original SVM training procedure after pre-whitening the training
data. The method is very elegant, as it nicely
18 CHAPTER 2. BACKGROUND
can be interpreted exchangeably as kernel modification or as a
method chang- ing the optimization target in terms of a tangent
covariance matrix. However, this pre-whitening involves (kernel)
PCA and appears to be computationally very hard in the nonlinear
case. The method therefore is restricted to differentiable
transformations with few parameters. Only experiments on fractions
of USPS are reported for the general nonlinear case, in contrast to
the whole USPS set as presented in Section 7.4.2. A similar
approach has been applied to kernel Fisher discriminant, which can
also be reformulated using the tangent covariance ma- trix [82]. A
conceptional argument against these methods might be, that they do
effectively not respect the local invariances: They rather use the
average of the local invariance directions of all training samples
in feature space. These average tangent covariance matrices per
transformation parameter are then additionally averaged over
multiple transformation directions.
Other conceptionally appealing methods consist of sophisticated new
optimiza- tion problems, which encode the sets of transformed
samples. By enforcing sep- arability constraints as in the SVM, new
classifiers are produced. For instance, assuming polyhedral sets Tx
results in the approach of the knowledge-based SVM [39]. If
differently polynomial trajectories of the patterns are assumed,
the re- sulting infinitely many constraints are condensed in a
semi-definite programming (SDP) problem. This problem is more
complex, but can still be solved for small sizes. However, again it
is problematic for standard benchmark datasets as the USPS digits
[43].
Some more studies are not aiming at total invariance in SVMs, but
investigate the behaviour of an SVM concerning uniform global
transformations. For instance, [62, Lemma 2] states that the SVM
solution, (the coefficients αi and the decision function) are
invariant with respect to global addition of real values c to the
kernel function. The simultaneous rotation invariance of the
Euclidean inner product and the additional translation invariance
of the induced distance transfers to similar transformation
behaviour for resulting SVM solutions with various distance or
inner product kernels, cf. [1, 101].
Basically, [112] distinguishes between methods for introducing
transformation knowl- edge into the object representation, the
training set or the learning method itself. Al- though this is an
intuitive categorization, certain methods can be interpreted in
multiple of these categories. For instance, as explained above, the
invariant hyperplane method can be seen as modifying the object
representations by a pre-whitening operation, but it was initially
motivated as modifying the learning target of an SVM. In [73] it is
argued that also the extension of the training set is in general
equivalent to adding a suitable regularization term into the
learning target.
Similar to the jittering kernels approach, we concentrate in the
following explicitly on methods modifying the learning method by
designing suitable kernel functions.
Chapter 3
Invariant Distance Substitution Kernels
This chapter proposes a method for incorporating general
transformation knowledge into kernel functions by suitable distance
functions. We first formalize a general method to construct kernel
functions from distances by distance substitution (DS) in common
distance-based or inner-product-based kernels. In particular, these
DS-kernels are not explicitly requiring transformation knowledge.
Additionally, they can be used for arbi- trary distance measures.
We investigate theoretical properties and comment on prac- tical
issues. Then, we focus on the main topic of the thesis, the
incorporation of transformation knowledge. We explain how many
existing distance measures represent transformation knowledge
according to Def. 2, and how such distance measures can be
constructed. Using such invariant measures in DS-kernels yields
invariant distance substitution kernels. In this part we will
restrict to the theoretical framework, whereas the following
chapter and the experiments will give specific instances of the
resulting invariant kernels.
3.1 Distance Substitution Kernels
A general construction method for kernels involving arbitrary
distances has been pro- posed in our study [49]. The resulting
kernels are not only applicable to invariant distances, but they
can also be applied to various common dissimilarity measures. So,
until Sec. 3.6 we do not explicitly require that the distance
measure represents transformation knowledge. This makes the kernels
very widely applicable. Especially, they offer the wide perspective
of applying modern kernel methods to structural pat- tern
recognition problems, where sophisticated distance measures,
matching procedures etc. are given. Correspondingly, the
experiments in Chap. 7 include both invariant and non-invariant
distance measures.
In literature, several studies construct SVM kernels from special
distances d(x, x′) between objects by taking the Gaussian
rbf-kernel and plugging in the problem-specific distance measure d.
We give a short but possibly non-exhaustive list of these. Initial
experimental results, mainly classification with certain Hilbertian
metrics on histogram data have been presented in [23]. In [8]
another analysis problem is tackled, segmen- tation of images by
spectral partitioning. The Gaussian kernel with the χ2 and an
19
20 CHAPTER 3. INVARIANT DISTANCE SUBSTITUTION KERNELS
intervening contour distance is applied, the former proven to be
pd, the latter stated to be indefinite. Kullback-Leibler divergence
between distributions has been used in SVMs [84] after making the
kernel matrix positive definite. Recent examples of dis- tances in
kernels can be found in [57], where again SVM classification of
histogram data was performed and [28], which applied DS-kernels on
semi-supervised learning problems. Further DS-kernel applications
are available, e.g. M-estimators in [29]. We have demonstrated the
applicability of transformation invariant distances on online
handwriting recognition [4] and optical character recognition
[51].
The aim of this section with respect to the existing work is to
give a common framework and to cover more kernel types including
linear and polynomial kernels and discuss upcoming questions. We
further explicitly allow general symmetric distances in contrast to
only regarding Hilbertian metrics. For such an arbitrary symmetric
distance measure d, the choice of an origin O ∈ X induces a
generalization of an inner product by the function
x, x′Od := −1
2 (d(x, x′)2 − d(x,O)2 − d(x′, O)2). (3.1)
This notation reflects the idea that in the case of d being the
L2-norm in a Hilbert space X , x, x′Od corresponds to the inner
product in this space with respect to the origin O. Note that this
is only a formal definition, usual properties like bilinearity are
not satisfied. Still a suitable interpretation as an inner product
exists after suitable embeddings into pseudo-Euclidean spaces, cf.
Chap. 6. With this notion we define the distance substitution
kernels as follows.
Definition 9 (DS-Kernels). For any distance-based kernel k, i.e.
k(x − x′), and symmetric distance measure d we call kd(x, x′) :=
k(d(x, x′)) its distance substitution kernel (DS-kernel).
Similarly, for an inner-product-based kernel k, i.e. k(x,x′), we
call kd(x, x′) := k(x, x′Od ) its DS-kernel.
This notion in the case of inner-product-based kernels is
reasonable as in terms of (3.1) indeed distances are substituted.
Note that they are well-defined for these inner- product-based
cases in contrast to the jittering kernels, cf. the comments
following (2.5). In particular, for the simple linear,
negative-distance, polynomial and Gaussian kernels, we denote their
DS-kernels by
klin d (x, x′) := x, x′Od knd
d (x, x′) := −d(x, x′)β, β ∈ [0, 2] (3.2)
kpol d (x, x′) :=
1 + γ x, x′Od )p
krbf d (x, x′) := e−γd(x,x′)2 , p ∈ N, γ ∈ R+.
Of course, more general distance- or dot-product-based kernels
exist and corresponding DS-kernels can be defined, e.g. sigmoid,
multiquadric, Bn-spline, exponential, rational quadric, Matern
kernels, [112, 40] etc. The motivation for this ad-hoc definition
of the simple DS-kernels is twofold: In the case of Euclidean
distances, this obviously recovers the standard kernels.
Additionally, in the case of non-Euclidean distances, the kernels
are expected to reflect the same behaviour as the standard kernels,
in particular the nonlinearity and herewith improved separability
by the polynomial or the Gaussian rbf- kernel. Indeed these
properties can be experimentally validated. Further, the handling
of the kernels in practice is expected to be similar, e.g. with
respect to parameter selection.
3.2. DEFINITENESS OF DS-KERNELS 21
A comment on the computational complexity is in order, in
particular for the case of expensive distances. Equation (3.1)
might indicate that the inner-product-based kernels are a multitude
slower than the distance-based ones. According to single kernel
evaluations this is true, as indeed three distance computations are
required instead of one. However, as typical in kernel methods, a
whole kernel matrix must be computed for analysis of a dataset. If
the distances of the data points with the origin O are computed in
advance in linear time, the dominating operation is the O(n2)
computation of the pairwise distances between the data points. So,
in terms of distance computations the slowdown factor reduces from
3 arbitrarily close to 1 for large enough data sets.
Computationally, the simple DS-kernels in (3.2) can therefore be
implemented with comparable number of distance evaluations. A
general very convenient property of the DS-kernels is the
possibility of precomputation of the distance matrices. Usual model
selection in kernel methods requires modification of basic kernel
parameters as γ, β, p for the simple kernels. These variations are
completely independent of the used distances, which might be
arbitrarily complex. So, DS-kernels allow fast kernel parameter
variations and model selection.
We continue with investigation of the DS-kernels’ general
properties. As the ker- nels are a common framework for the
existing experimental studies listed above, the conclusions of the
following considerations are similarly valid for the cited
studies.
3.2 Definiteness of DS-Kernels
The most interesting question posed on new kernels is whether they
are (c)pd. In fact, for DS-kernels given by (3.2) the definiteness
can be summed up quite easily. The necessary tools are well-known
based on early references [140, 104, 105, 81, 42], which have since
then been reformulated in various ways, cf. [11, 112]. Very nice
extensive characterizations of constructing pd and cpd functions
from Hilbertian metrics are given in [104] based on the notion of
completely monotone functions, where a continuous f(t), t ≥ 0 is
called completely monotone if (−1)kf (k)(t) ≥ 0 for all 0 < t
< ∞ and k ∈ N0, which is a criterion which can easily be checked
for a given function, e.g. by mathematical induction.
Proposition 10 (Definiteness of General DS-Kernels). Let d be a
Hilbertian metric.
i) If k(t) is completely monotone then k(d2) is pd.
ii) If − d dt
k(t) is completely monotone and k(0) = 0 then k(d2) is cpd.
Proof. i) is one direction of [104, Thm. 3], for which a very
elegant proof recently has been given in [5]. ii) is a combination
of one direction of [104, Thm. 6’] with a Hilbert space
representation of cpd functions [11, p. 82].
These statements allow to conclude that if d is a Hilbertian
metric, e.g. the kernel 1
γ+d2 or e−γd2
is pd for γ > 0. Further statements for definiteness of more
general dot- product- or distance-based kernels are possible, e.g.
by Taylor series argumentation, cf. [11, p. 70, Cor. 1.14].
Additionally, known closure properties of the set of positive
definite kernels can be applied, e.g. positive linear combinations
of pd functions yield pd functions, products of pd functions are pd
etc. These result in further DS-kernels.
22 CHAPTER 3. INVARIANT DISTANCE SUBSTITUTION KERNELS
For simple DS-kernels, the (c)pd-ness can be nicely concluded with
equivalences in all combinations:
Proposition 11 (Definiteness of Simple DS-Kernels). The following
statements are equivalent for a (nonnegative, symmetric,
zero-diagonal) distance d:
i) d is a Hilbertian metric
ii) knd d is cpd for all β ∈ [0, 2] iii) klin
d is pd
iv) krbf d is pd for all γ ∈ R+ v) kpol
d is pd for all p ∈ N, γ ∈ R+.
Proof. i) implies ii): [112, Prop. 2.22] covers the case β = 2 and
Prop. 10 ii) settles the statement for arbitrary β ∈ [0, 2] as −
d
dt tα is completely monotone for 0 ≤ α ≤ 1. The
reverse implication ii) ⇒ i) follows by [112, Prop. 2.24].
Equivalence of ii) and iii) also is a consequence of [112, Prop.
2.22]. [105, Thm. 1] states the equivalence of ii) and iv).
Statement v) follows from iii) as the set of pd functions is closed
under products and linear combinations with positive coefficients.
The reverse can be obtained from the pd functions 1
γ kpol
d . With p = 1 and γ → ∞ these functions converge to x, x′Od .
Hence, the latter also is pd.
In practice, however, problem-specific distance measures often lead
to DS-kernels which are not (c)pd. A practical criterion for
disproving (c)pd-ness is the following. If d violates the triangle
inequality then it is in particular not a Hilbertian metric, so the
resulting DS-kernel knd
d is not cpd and klin d , krbf
d , kpol d are not pd due to Prop. 11. This
can immediately be applied to kernels based on tangent distance
(TD) [51], dynamic time warping (DTW) distance [4] or
Kullback-Leibler divergence [84]. Also, applying further common
non-metric distance measures, e.g. [60], will therefore produce
non-pd kernels. Note, that for certain values β, γ, p, the
resulting DS-kernels are possibly (c)pd, but not for all parameters
simultaneously. Remind also that the reverse is not true, i.e.
metricity is not sufficient for (c)pd-ness of the resulting
DS-kernels. In particular, the Lp-metrics for p > 2 can be shown
to produce non-pd DS-kernels in contrast to lower values for p as
discussed in the next section.
If a kernel is not positive definite in general, it can be positive
definite on a certain given dataset. Even if it is not positive
definite on a given dataset, it can be applied in certain kernel
methods, as long as they can work with indefinite matrices, or they
tolerate a certain amount of negative eigenvalues, e.g. matrix
centering, inversion, ap- plication of SVMs or the kernel
perceptron. Details of the effects of indefinite kernels on SVMs
will be elucidated in Chap. 6. In real world problems, the
indefinite kernel matrices often still offer a high degree of
positivity. For instance, the krbf-DS-kernels will always generate
positive semi-definite 2× 2 matrices: The assumptions on the dis-
tance d imply a matrix-diagonal with value 1 and off-diagonal
elements between 0 and 1, which always yields two nonnegative
eigenvalues. For larger matrices, the number of negative
eigenvalues and the overall sum of negative eigenvalues turns out
to be small.
We demonstrate this for the simple DS-kernels on various datasets
in Fig. 3.1. The datasets are denoted as proteins, kimia-1,
cat-cortex and unipen-1 and will be mainly used for the experiments
in Chap. 7. Details on the setup and the sources of the datasets
are given in Appendix A. Here it is sufficient to know that they
consist of problem specific distance matrices with two- or
multiclass labellings. The used distances are a) an evolutionary
distance between proteins, b) a modified Hausdorff distance
between
3.2. DEFINITENESS OF DS-KERNELS 23
0 50 100 150 200 250 −3
−2
−1
0
0
5
10
15
20
i
0
100
200
300
i
0
2000
4000
6000
8000
i
a) b)
c) d)
Figure 3.1: Illustration of DS-kernel spectra for different
distance matrices and the resulting characteristics n−, n+ (number
of negative/positive eigenvalues) and the ratio r of the overall
negative and positive variance. a) knd on proteins, b) krbf on
kimia-1, c) klin on cat-cortex, d) kpol on unipen-1.
binary shape-images, c) a dissimilarity based on the connectivity
strengths of regions in a cat’s cerebral cortex and d) a dynamic
time warping distance between handwritten characters from the
UNIPEN dataset. In addition to the sorted eigenvalue spectra, we
added some quantitative measures of the definiteness. These
comprise n−, n+, the numbers of negative and nonnegative
eigenvalues, and a measure of the negative vari- ance contribution.
As the eigenvalues of a kernel matrix can be interpreted as
variance of appropriate data embeddings, the sums correspond to the
amount of information (variance) captured in the negative and
positive direction, respectively. So, we define the ratio r of the
negative to the positive eigenvalue sum
r(K) :=
∈ [0,∞]. (3.3)
Ideally, this quantity is 0 for a positive definite matrix, whereas
it goes to infinity, if a matrix has only negative eigenvalues. The
upper row demonstrates the spectra of the distance-based kernels.
Plot a) contains the spectrum of the knd kernel matrix on the
proteins-dataset, b) the krbf-kernel on the kimia-1 dataset. krbf
demonstrates nicely, how the Gaussian kernel is largely dominated
by the positive eigenvalues, only
24 CHAPTER 3. INVARIANT DISTANCE SUBSTITUTION KERNELS
few negative values are observed. One single large negative
eigenvalue of knd would be acceptable, as cpd kernel matrices have
at most one (arbitrarily large) negative eigenvalue. The second row
depicts the spectra of the inner-product-based kernels obtained by
choosing the point with minimum squared-distance-sum as the origin
O. The reason for this choice will be given in Sec. 3.5. The klin
kernel on the cat-cortex dataset in plot c) also behaves largely
dominated by the positive eigenvalues. Finally, the kpol kernel on
the unipen-1 data in d) demonstrates similar properties. These are
not accidental patterns, but can be widely observed. Even solely
based on random distances, the resulting kernel matrices reveal
these characteristics.
3.3 Examples of Hilbertian Metrics
As argued in the preceding section, practical dissimilarity
measures will frequently result in indefinite DS-kernels. Learning
from such kernels indeed is possible, as we will argue in Chap. 6,
but it is still largely uninvestigated. One purist attitude to
kernels is not to accept anything beyond the positive definite
ones. Therefore, we want to comment in this section, why DS-kernels
can even be useful from this pure view.
Actually, many cases exist, where DS-kernels can be used completely
safely as the distances indeed are Hilbertian metrics. For some
distance measures, the relation to an L2-norm is apparent. An
example is the Hellinger distance H(p, p′) between
probability
distributions which is defined by (H(p, p′))2 := ∫ (√
p −√ p′ )2
. This is obviously the usual L2-norm on the square-roots of the
distributions. However, the class of distances
which are Hilbertian metrics is much wider than the obvious forms d
= √
x − x′2.
For instance, many of the simple Lp norms ·p are related to
Hilbertian metrics, where p is not 2. The paper [6] concludes that
−x − x′1 is cpd based on the el- der result from [105, 81] that −x
− x′2 is cpd. Further conclusions are that also
−x − x′p p is cpd for p ∈ (0, 2]. This implies that x − x′1/2
1 and more general
x − x′p/2 p is a Hilbertian metric. As fractional powers of
nonnegative cpd functions
may be taken, e.g. [11, p. 78], this implies that x − x′q p is a
Hilbertian metric for
p ∈ [0, 2] and q ∈ [0, p/2]. When the objects are observations of
binary random variables i, j and the frequen-
√
1 − S(i, j), where S is one of various suit- able similarity
measures [42]. For various choices of S = Sθ or S = Tθ, the
resulting dissimilarity will generate Euclidean distance matrices.
The reference proves this for parametric families of
dissimilarities
Sθ = A + D
A
A + θ(B + C) ,
if 0 < θ ≤ 1 for Sθ and 0 < θ ≤ 1/2 for Tθ. The same
exposition investigates several dissimilarities for real valued
data, and concludes their metric and Euclidean properties. Among
others, the measure
d(x,x′) := ∑
i
3.3. EXAMPLES OF HILBERTIAN METRICS 25
is proven to generate Euclidean distance matrices. Similarly, [8]
proves very elegantly that krbf√
χ2 is pd, where
xi + x′ i
√
χ2 is a Hilbertian metric. Only looking at the χ2-distance, the
corresponding Hilbert space is not apparent.
Other examples of Hilbertian metrics are given in a recent paper
[57]. The objects between which dissimilarities are constructed are
probability measures on an arbitrary set X . A parametric class of
Hilbertian metrics is constructed based on suitable inte- grals of
Hilbertian metrics on R+. Interesting spec