73
Legged locomotion: towards reinforcement learning dynamics interpretation E.A.C.S. Peeters DCT 2009.33 Master’s thesis Coaches: prof. dr. ir. P.P. Jonker 1,2 dr. ir. D. Kostic 1 Supervisor: prof. dr. ir. P.P. Jonker 1,2 prof. dr. ir. H. Nijmeijer 1 Committee: Supervisor and coaches (see above) ir E. Schuitema 2 dr. K. Tuyls 3 1 Technische Universiteit Eindhoven Department Mechanical Engineering Dynamics and Control Technology Group Eindhoven, The Netherlands 2 Technische Universiteit Delft Department Mechanical, Maritime and Materials Engineering Delft, The Netherlands 3 Technische Universiteit Eindhoven Department Industrial Design Eindhoven, The Netherlands Eindhoven, April, 2009

Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Legged locomotion: towardsreinforcement learningdynamics interpretation

E.A.C.S. PeetersDCT 2009.33

Master’s thesis

Coaches: prof. dr. ir. P.P. Jonker 1,2

dr. ir. D. Kostic 1

Supervisor: prof. dr. ir. P.P. Jonker1,2

prof. dr. ir. H. Nijmeijer1

Committee: Supervisor and coaches (see above)ir E. Schuitema2

dr. K. Tuyls3

1Technische Universiteit EindhovenDepartment Mechanical EngineeringDynamics and Control Technology GroupEindhoven, The Netherlands

2Technische Universiteit DelftDepartment Mechanical, Maritime and Materials EngineeringDelft, The Netherlands

3Technische Universiteit EindhovenDepartment Industrial DesignEindhoven, The Netherlands

Eindhoven, April, 2009

Page 2: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

2

Page 3: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Abstract

People have been imagining about machines capable of human like locomotion for decades.Legged locomotion seems straightforward as it is part of our everyday life, but it turns out that bal-ance, the inherent non-linearity, the enormous set of environments the machine should deal withand the diversity of possible tasks that a legged machine is expected to be capable of are still toochallenging. It could be argued that some kind of rapid learning and generalization capabilitiesare required for balance, since not every task and environment can be precisely preprogrammed.This work focusses on learning and balance.

The most general definition of balance states that during any motion of the mechanism somefall-states may not be visited. These fall-states are characterized by having one or more undesir-able contacts with the ground. Classical balance conditions like the zero-moment-point are notsufficient or necessary to guarantee balance according to this general condition. Here balance byreinforcement learning is explored to satisfy the general balance condition.

Previous works on balance of legged locomotion by reinforcement learning mostly use rather adhoc reward structures consisting out of a composition of many kinds of rewards. As a conse-quence, the meaning of what is being learned is often not clear. This work argues that the mean-ing of what is being learned can be controlled by the reward rule. It is demonstrated for somesimple example systems that, by restricting the reward rule, the basins of attraction of some goalregion can be reconstructed from the reinforcement learning data. This could be useful in manylearning problems for dynamical systems.

The possibility to compute basins of attraction of goal regions is used in a second step to learncomplex compound motions. A complex motion can be obtained by taking a learned motionconsisting of a goal region and a basin of attraction and connecting it to another. The goal regionof the first tasks must intersect the basin of attraction of the goal region of the second task inorder to connect them. Thus some connectivity rule is available. These compound motions aredemonstrated on a extended bouncing ball problem.

Lastly, it is argued that these compound motions are suitable to learn balance. Hereto firstly, thestatement that some set of fall-states should be avoided is transformed into the statement thatany non-fall state should be reachable. Fall and non-fall-states are separately collected in targetregions which have distinct contact configurations. In this way the total balance learning problemis divided in a set of subtasks. Every subtask learns how to reach a particular next non-fall contactconfiguration. If any non-fall contact configuration is reachable, the motion is said to be balancedin one transition. By using the connectivity condition, balance of any sequence of transitions canbe computed.

3

Page 4: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

4

Page 5: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Samenvatting

Al tientallen jaren spreken lopende machines, in het bijzonder mensachtige robots, tot de ver-beelding van velen. Wie wil er nu niet een persoonlijke robot assistent die onze taken kan overne-men? Lopen lijkt eenvoudig voor ons, toch blijkt dat de huidige stand van wetenschap en tech-niek tekortschiet om een lopende robot door alledaagse omgevingen te laten bewegen. Balans,de inherente niet-lineariteit en de veelvoud aan omgevingen en taken zijn enkele van de openuitdagingen. Omdat niet iedere taak en omgeving precies voorgeprogrammeerd kan worden, hetzijn er te veel, wordt gezocht naar manieren om robots te laten leren, generaliseren, anticiperen,enz. Dit werk richt zich op balans en leren.

De meest algemene definitie van balans stelt dat een bepaalde verzameling toestanden, waarin derobot met een deel anders dan de voet contact maakt met de grond, ten alle tijden moet wordenvermeden. Conventionele balans condities voor lopende machines, zoals het zero-moment-point,zijn niet voldoende noch noodzakelijk voor balans. Dit werk probeert met reinforcement learningeen regelaar/conditie te vinden die voldoet aan de algemene definitie van balans.

Eerdere werken die gericht zijn op het leren van balans met behulp van reinforcement learn-ing gebruiken meestal ad-hoc beloningsstructuren bestaande uit enkele verschillende soorten be-loningen. Het gevolg daarvan is dat de data die het leerproces genereert geen duidelijke betekenishebben. Dit werk stelt dat de betekenis van die data kan worden vastgelegd door de definitie vande beloningsstructuur. Het wordt aangetoond voor enkele eenvoudige voorbeeld systemen datdaarmee bijvoorbeeld het attractie-domein behorende bij een evenwichtspunt of doelgebied kanworden gereconstrueerd. Dit resultaat kan nuttig zijn voor vele systemen.

De mogelijkheid om attractiedomeinen tot doelgebieden te kunnen bepalen wordt in een tweedestap aangewend om complexere taken samen te stellen. Twee deeltaken bestaande uit een doel-gebied en het attractiedomein tot dat doelgebied kunnen worden verbonden indien het doelge-bied van de eerste deeltaak valt binnen het attractiedomein tot het doelgebied van de tweededeeltaak. Het bestaan van een verbindingscriterium tussen deeltaken heeft als groot voordeel datde verbindingen tussen deeltaken niet afzonderlijk geleerd hoeven te worden. Enkele samengesteldebewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakkenkan kaatsen.

Als laatste presenteren we een manier waarmee het wellicht mogelijk is om balans te leren.Hiertoe is de definitie van balans omgezet van het ontwijken van alle val-toestanden naar hetbereiken van een niet-val toestand. De val en niet-val toestanden zijn afzonderlijk gegroepeerd indoelgebieden die verschillen in hun contact configuratie. Op deze manier is het leren van balansverdeeld in het leren van een aantal deeltaken. Iedere deeltaak leert hoe een andere contactconfiguratie kan worden verkregen. Door gebruik te maken van het verbindingscriterium kanvan iedere reeks transities, die per definite gebalanceerd is, worden bepaald of ze mogelijk is.

5

Page 6: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

6

Page 7: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Contents

1 Introduction 9

2 Modeling and simulation of legged robotic locomotion 132.1 Hybrid systems framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Complementarity Lagrangian Framework . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Multi-body dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Impact Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.3 Friction Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Template models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Compass Walker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Four-link two-dimensional walker . . . . . . . . . . . . . . . . . . . . . . 202.3.3 Seven-link two-dimensional walker . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.1 Time-stepping simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.2 Time-stepping versus event detection . . . . . . . . . . . . . . . . . . . . 24

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Balance and balancebility of legged robotic locomotion 273.1 Legged locomotion balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Tracking of balance-constrained trajectories . . . . . . . . . . . . . . . . . 283.1.2 Limit cycle walking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.3 The energy capture balance condition . . . . . . . . . . . . . . . . . . . . 303.1.4 Reverse engineering of balance: learning . . . . . . . . . . . . . . . . . . 31

3.2 Legged locomotion balancability . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.1 Compass walker example . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Reinforcement learning 354.1 Reinforcement learning and classical control . . . . . . . . . . . . . . . . . . . . 354.2 Reinforcement learning problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4 Gridworld example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7

Page 8: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

CONTENTS

5 Reinforcement learning and dynamics interpretation 435.1 Domain of attraction interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.1 Qubic-linear-system example . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 Balance interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.1 Compass walker example . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Compound reinforcement learning: dynamics interpretation 536.1 Compound motions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1.1 Simple connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.1.2 Complex connectivity: compound RL . . . . . . . . . . . . . . . . . . . . 55

6.2 Multi-surface bouncing ball example . . . . . . . . . . . . . . . . . . . . . . . . . 576.2.1 Bouncing ball . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.2.2 Multi-surface bouncing ball . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Balance and compound motions . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.3.1 Integration of balance control in a larger scheme . . . . . . . . . . . . . . 67

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Conclusions and recommendations 697.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Bibliography 72

A Mathematical review 73A.1 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73A.2 Set-valued functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73A.3 Normal cone, tangential cone and proximal point . . . . . . . . . . . . . . . . . . 73

8

Page 9: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Chapter 1

Introduction

People have been imagining about machines capable of human like locomotion for decades.What is more appealing than a personal assistant capable of doing every tasks people are ca-pable of in their daily life? Legged locomotion seems straightforward as it is part of our everydaylife, but it turns out to be one of the hard problems in science. Already in 1961 the first industrialrobot was installed in the General Motors factory, called UNIMATE, replacing human labor incar assembly. Autonomous cars exist, autonomous aircraft, but bipedal robots that can walk likeourselves in many kinds of environments have not yet been built. What is it that makes bipedallocomotion such a challenging problem?

The major challenges in legged locomotion turn out to be the requirement for balance, theinherent non-linearity, the enormous set of environments the machine should deal with andthe diversity of possible tasks that a legged machine is expected to be capable of. It could beargued these demands require some kind of rapid learning and generalization capabilities, sincenot every task and environment can be precisely preprogrammed. The robot must be able toselect and plan motions according to observations of its environment, to learn new motions, togeneralize, to build on previous knowledge. This makes legged locomotion challenging.

Legged machines that lack these kinds of capabilities, and as a result can only perform somelimited set of simple locomotion, already exist for some time. In the eighties Tad McGeer sug-gested the concept of passive walking and some mechanisms were built that can decent a slopein a balanced manner [23] (e.g. Museon Walker TU Delft, Figure 1.1a). The gain in energy due todescending in a gravitational field must be precisely canceled by the energy loss due to friction.Consequently, only a very small set of motions is possible. In the seventies the zero momentpoint (ZMP) concept was developed by Vukobratovic [19] [24] and robots were built that couldwalk actively. Probably the most advanced and famous biped robot is the Honda Asimo [27] (Fig-ure 1.1b), it can walk, run, climb stairs. However, it can only do this in ideal environments, alsoit consumes a lot of energy and its motions do not appear natural. In conclusion, even the state-of-the-art controlled biped robots today exhibit a pretty limited set of motions. Legged machineswith four or more legs are easier to balance because they have a wider base compared to theirheight. An example of a four-legged machine is Big Dog from Boston Dynamics (Figure 1.1c).

This work focusses on balance and learning because these are some of the most challengingissues in legged locomotion. An analysis of the available literature reveals that a rather decentdefinition of balance is available, however a general approach to design motions such that thisdefinition of balance is satisfied is lacking. Typically, motions are designed to satisfy certain

9

Page 10: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Introduction

(a) Museon walker TU Delft. (b) Asimo Honda. (c) BigDog Boston Dynamics.

Figure 1.1: some examples of legged machines.

heuristic conditions, like static stability, zero moment point/center of pressure [24], limit cyclewalking [9]. However, none of these conditions actually guarantees balance, therefore these con-ditions are not sufficient [11]. In addition, none of these conditions is strictly necessary to achievebalance. In this report a different approach is taken, it is argued that it might be possible toreverse engineer a controller that approximately satisfies a general balance condition by usingreinforcement learning. Reinforcement learning has been applied to control simple walkers [25][30], however to the best of our knowledge no detailed studies have been performed that focus onwhat is actually learned by these schemes and in particular how the resulting controllers can beunderstood in terms of the dynamics of the walker. Therefore our problem statement is:

How can what is being learned by reinforcement learningbe interpreted in terms of dynamics for legged locomotion?

Gradually building up complexity, firstly the interpretation of what is being learned by reinforce-ment learning for a straightforward dynamical system is considered. While learning, a functionnamed the action-value function, is adapted. This function represents what is being learned. Ourapproach is to select a reward scheme such that the basin of attraction can be reconstructed fromthis action-value function. Secondly, it is demonstrated that reinforcement learning can extentthe basin of attraction, reconstructed from the action-value function, from a part of the state-space to the whole state-space for a simple example systems. This result might be useful in otherapplication domains as well. Next, a similar approach on a simple 2D walker is follow. Insteadof enlarging the domain of attraction of a certain equilibrium, the domain of attraction to a setof end-of-step postures is enlarged. The maximally obtainable region of attraction to this set ofend-of-step postures is known from previous work [31]. In this work it seems that the learnedcontroller enlarges the domain of attraction to the theoretical maximum.

The region of attraction to a set of end-of-step postures for the simple 2D walker approximates theregion of attraction to balance in some sense, but it does not fully characterize balance accordingto the most general definitions. Firstly, it is possible that the region that can be reached fromone transition to another shrinks. For example, suppose the walker descends a slope, but the

10

Page 11: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

slope is not steep enough to let the walker continue walking as it loses some energy every step.The walker walks a few steps, however at a certain moment it will not be possible to make a nextstep. It then happens that the resulting initial condition for the next step is outside the single-step attraction basin. Secondly, the simple walker could balance without actually completing astep, e.g. by stepping backwards again. If more complex walkers are considered, there are evenmore options to accomplish a step. For example, flat footed walking, toe-off walking or running.Therefore, there exists a need to further expand the approach to consider more complex motions.We demonstrate a first attempt and reformulate the balance problem from not-falling in reachinga certain set of target regions. Based on an interpretation of the action-value function, a divisionof the walking motion in modes is proposed and an algorithm to determine if a sequence ofmodes can be executed is developed. This is demonstrated on a bouncing ball example.

The report is organized as follows. Chapter 2 discusses modeling of legged machines. In Chapter3 the concept of balance for legged locomotion is introduced. Chapter 4 discusses reinforcementlearning. In Chapter 5 a dynamics and balance interpretation of the action-value function in rein-forcement learning is presented. Chapter 6 introduces a way to split up the a walking sequence inmodes and establish criteria for connectivity between those modes. The reader who is already fa-miliar with modeling of walkers in the hybrid systems framework, the legged locomotion balanceliterature or reinforcement learning may skip Chapters 2, 3 or 4, respectively.

11

Page 12: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Introduction

12

Page 13: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Chapter 2

Modeling and simulation of leggedrobotic locomotion

In the legged locomotion literature two approaches to modeling can be distinguished. The firstapproach is to take an example of a walker and only model the dynamics relevant for that walker.This approach is named the template model approach in the legged locomotion literature. Theadvantage of the template model approach is that much analysis which cannot be performedin the general framework can be performed for particularly simple two-legged machines. Butalthough widely applied [11] this approach is ad hoc. Therefore, more general modeling frame-works have been proposed for legged robotic locomotion [3],[11]. Modeling legged locomotion bya general framework is the second approach. As a general framework by which almost all two-legged robots can be modeled the hybrid systems framework has been introduced. [3] restricts toa subclass of hybrid dynamical systems named Lagrangian complementarity systems.

Both approaches are needed in this report because a division in subtasks for control of leggedlocomotion is designed based on the general framework, but also it is demonstrated how a partic-ular reward structure in reinforcement learning can provide insight in the dynamics of a simplecompass walker for which analytic results are available. The compass walker unfortunately doesnot fit in the general complementarity Lagrangian framework, because it requires an extra modeto prevent ground scuffing when the legs are in parallel. It however fits in the more generalhybrid systems framework.

This chapter is organized as follows. Section 2.1 introduces the hybrid systems framework. Sec-tion 2.2 details a subclass of hybrid dynamical systems named complementarity Lagrangian sys-tems. Section 2.3 introduces template models, especially the compass walker template model.Section 2.4 discusses how the introduced models can be simulated.

2.1 Hybrid systems framework

Walkers exhibit complex behavior. Their dynamics is determined by the interaction of manyinterconnected bodies which are subject to a gravitational field. This fact alone already resultsin highly non-linear dynamics. But in order to locomote, the set of interconnected bodies mustbe propelled by pushing against the environment at closed contacts. These contacts can openand start sliding if a tangential force threshold with respects to the contact surface is exceeded.In order to locomote, multiple contacts must be opened and closed in a certain spatio-temporal

13

Page 14: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Modeling and simulation of legged robotic locomotion

pattern. In addition to that, the walkers are subject to internal flexibilities, motor dynamics and acontinuously changing environment. In this report external flexibilities and motor dynamics areomitted.

Every time the system enters a new contact arrangement, the dynamics of the walker changes.Therefore it is natural to think of all possible contact situations as separate modes, between whichthe system is switched if some condition is met, i.e. if a new contact is opened, closed or startssliding. A general mathematical framework in which dynamical systems with multiple modesare considered, is the framework of hybrid dynamical systems. A hybrid dynamical system isa combination of discrete models (finite state machines / automata) and continues models. Ageneral hybrid dynamical system can be expressed in the following way [8]:A hybrid automaton is a collectionH = (Ω, X, f, Init, Inv,E,G, R)

• Ω = ω1, ω2, ..., ωN is a finite set of modes (discrete states)

• X = Rn is a set of continues states

• f : Ω×X → X is a vector field

• Init ⊆ Ω×X is a set of initial states

• Inv : Ω→ P (X) describes the invariant

• E :⊆ Ω× Ω is the set of edges (transitions)

• G : E → P (X) describes the guard

• R : E → P (X ×X) describes the reset condition

Such a general hybrid automaton can be represented by the graph depicted in Figure 2.1.

The meaning of Ω, X , f , Init is straightforward. The invariant, Inv, states for every mode theconditions on the states for which the system stays in that particular mode. The set of transitionsE contains all possible transitions from one mode to another. Its elements are denoted by thedeparture mode and the arrival mode, e.g. E = (ω1, ω2), (ω1, ω2). Every transition has a guardG that specifies when the transition happens. The reset mapR specifies how the continues statesafter the transition are related to the continues states before the transition.

Other important definitions in hybrid systems are the Reach set and the Out set. Loosely, theReach set contains all state-mode combinations that are reachable from some other state-modecombinations and the Out set contains all states from where continuous evolution is impossible(a mode change however may occur, or no transition is possible). Stated in another way, the Outset is the complement of the invariant.

Definition 2.1 (Reach). A state (ω, x) ∈ Reach, if there exists a finite execution (τ, ω, x)with τ = [τi, τ ′i ]N1=0 and (ω(τ

′N ), x(τ

′N )) = (q, x)

Definition 2.2 (Out). Out = (x0, ω0) ∈ Ω×X | ∀ε > 0,∃t ∈ [0, ε) s.t. xω0,x0(t) 6∈ Inv(ω0)

Using the above definitions well-posedness of a hybrid dynamical system can be stated. Well-posedness means that given an initial condition, a solution exists and is unique. First of all,inside the modes a sufficient conditions for well-posedness is the Lipschnitz continuity condition.Secondly, it must hold that the hybrid automaton is deterministic, i.e. each set of next states has 0

14

Page 15: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

2.2 Complementarity Lagrangian Framework

ω0

x = f (ω0, x)

x ∈ Inv(ω0)

ω1

x = f (ω1, x)

x ∈ Inv(ω1)

ω2

x = f (ω2, x)

x ∈ Inv(ω2)

G(ω0, ω1)

G(ω1, ω0)

G(ω1, ω2)

G(ω2, ω0)

G(ω2, ω1)

G(ω0, ω2)

R(ω1, ω0)

R(ω0, ω1)

R(ω2, ω1)

R(ω1, ω2)

R(ω2, ω0)

R(ω0, ω2)

(ω0, x0) ∈ Init

Figure 2.1: Schematic representation of a general hybrid system with 3 modes.

or 1 element. Thirdly, the hybrid automaton must be non-blocking, i.e. continues or discontinuesprogress must always be possible.

A hybrid automaton is deterministic if (1) all guards lie in the outset (and thus not in the invari-ants), (2) there are no overlapping guards and (3) reset maps do not induce multiple possible nextstates. A hybrid automation is non-blocking if there exist a transition for all reachable states inthe Out set.

Modeling of most legged machines does not require the complete general hybrid systems frame-work. Legged machines have some special properties. Firstly, in all modes the dynamics aredetermined by Lagrangian or Eulerian dynamical equations of motions. Secondly, all events thatoccur relate to a change of contact situation and thus all events exhibit similar behavior. Thirdly,contact dynamics are typically specified by a set of unilateral constraints.

2.2 Complementarity Lagrangian Framework

A more specific frame that deals with the kind of dynamics typically present in legged locomotionis the complementarity lagrangrian framework ([7], [3]) with impact and static friction law. Thisframework can be represented by the following equations of motion:

M(q)q +H(q, q) = SF app +WNλN +WTλT (2.1a)0 ≤ gN ⊥ λN ≥ 0 (2.1b)γ+N = −eNγ−N (2.1c)

λT = −µλNsign(γT ) (2.1d)

15

Page 16: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Modeling and simulation of legged robotic locomotion

Herein q is a vector containing the generalized coordinates, M(q) is the mass matrix, H(q, q) isa vector that contains the gravity, centrifugal and Coriolis forces, S is a matrix that contains thedirections of the applied forces, F app is the non-conservative applied forces vector,WN is a matrixthat represents the normal contact force directions, WT is a matrix that represents the contactfriction force directions, λN is the normal force vector, λT is the tangential friction force vector,gN is the normal contact distance vector, γ+

N is the normal contact velocity vector just beforeimpact, eN is the restitution coefficient, γ−N is the normal contact velocity vector immediately afterimpact, µ is the friction coefficient and γT is the tangential contact velocity vector. Constraintsand uncertainties are omitted.

The next three subsections elaborate on the equations of motion, the impact law and the frictionlaw.

2.2.1 Multi-body dynamics

One way to express the equations in an Euler-Lagrange format is:

d

dt

(∂K

∂q

)− ∂K

∂q+∂P

∂q= Fnc. (2.2)

Herein q, K, P and Fnc are the generalized coordinated, the kinetic, the potential energy and thenon-conservative forces respectively. The equations of motion are rewritten in such a way thatthe parts needed to derive the equations of motion by an algorithm given the forward positionkinematics can be identified. The forward position kinematics are:

x = ffk(q), (2.3)

linking the generalized coordinates q to the positions of the links in cartesian space. The kineticenergy can be written as:

K =12qTMq. (2.4)

The mass matrix M is defined with respect to the generalized coordinates and can be computedfrom the mass matrix Mbc in the following way:

v = x =∂ffk(q)∂q

dq

dt= Jq (2.5)

K =12vTMbcv =

12qT (

∂ffk(q)∂q

)TMbc∂ffk(q)∂q

q =12qTJTMbcJq =

12qTMq. (2.6)

If the expression (2.4) for the kinetic energy is substituted into (2.2) the result is:

d

dt(Mq)− 1

2qT∂M

∂qq +

∂P

∂q= Fnc (2.7)

Mq + M q − 12qT∂M

∂qq +

∂P

∂q= Fnc. (2.8)

16

Page 17: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

2.2 Complementarity Lagrangian Framework

Equation (2.8) can be rewritten using

H , M q − 12qT∂M

∂qq +

∂P

∂q(2.9)

intoMq +H = Fnc. (2.10)

All elements for algorithm 2.1, whose purpose is to symbolically derive M and H , have beenspecified.

Algorithm 2.1 calculate the equations of motioninputs: x = ffk(q) and Mbc

outputs: M and H

1: J = ∂ffk(q)∂q

2: M = JMbcJ3: M = ∂M

∂q q4: P = ffk(q)Mbcg5: H = M q − 1

2 q∂M∂q q + ∂P

∂q

2.2.2 Impact Law

Basically, the impact dynamics model must specify 1) when impacts occur, 2) what happens atthe impact times and 3) what is the state after impact. Before addressing these three issues weintroduce Signorini’s law that describes normal contacts between rigid bodies.

Let us denote the contact distance by gN , the contact distance velocity by γN and the contact forceby λN . gN and λN may never be negative, because the multi-body system may not penetratethe ground and the normal force may not point into the ground (no stick assumption). Twosituations can occur: (1) gN = 0 and λN ≥ 0 or (2) gN > 0 and λN = 0 that can be summarizedby a complementarity condition (2.11).

gN ≥ 0, λN ≥ 0, gNλN = 0 (2.11)

Equivalently, this complementarity can be written as an orthogonality condition:

0 ≤ gN ⊥ λN ≥ 0. (2.12)

The drawback of this formulation is that it is an inequality combined with an orthogonal operatorand not an equation that can be readily used. Therefore, the alternatively (2.12) can be expressedas

−gN ∈ NCN(λN ), CN ∈ R+, (2.13)

using the normal cone concept (see appendix A.3). Herein the set CN is the set of admissiblecontact forces. Or

λN = proxCN(λN − rgN ), CN ∈ R+, r > 0, (2.14)

using the proximal point concept (see appendix A.3).

17

Page 18: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Modeling and simulation of legged robotic locomotion

Especially the proximal point formulation is useful as it is an implicit formulation that can bedirectly used to express orthogonality and positiveness of the contact forces. The shift λN of theregularization must be solved for and r is a steepness parameter.

Many impact laws, including the one used here, are defined on velocity level. Therefore, also thenormal cone and proximal point formulation on velocity level are given. The contact distancevelocity γN can be obtained by

γN =∂gN∂q

dq

dt+∂gN∂t

, W TN q + wN . (2.15)

Herein W TN , ∂gN

∂q are the generalized force directions in terms of the generalized coordinates q

and wN , ∂gN∂t . The normal cone and proximal point formulation on velocity level are

−γN ∈ NCN(λN ), CN ∈ R+ (2.16)

andλN = proxCN

(λN − rγN ), CN ∈ R+, r > 0, (2.17)

respectively. These are equivalent to the formulations on the position level provided that theposition initial conditions satisfy the conditions on position level.

Let us answer the three questions raised in the beginning of the section. 1) When does a contactoccur? Let IC ∈ 1, 2, 3, ..., nc be the set of all contact points, the set of all contact points that areclosed is:

ICcl= i ∈ IC | gNi = 0. (2.18)

2) What happens at impact? At impact gN = 0, γN = 0 and λN > 0.

3) What is the state after impact? Newton’s impact law that relates the post-impact velocity γ+N to

the pre-impact velocity γ−N states:γ+N = −eNγ−N . (2.19)

Herein eN is the restitution coefficient. If eN = 1 the impact is elastic and γ+N equals γ−N and if

eN = 0 the impact is completely inelastic (γ+N = 0).

The impact law is reformulated on impulse level, because the forces at impact can go to infinity.This yields:

ξN , γ+N + eNγ

−N (2.20)

0 ≤ ξN ⊥ ΛN ≥ 0. (2.21)

Hereby the impulse ΛN and the variable ξN are introduced. The variable ξN is necessary to definea complementarity that holds for all eN .

If an impact occurs (ΛN > 0), ξN = 0 and the impact law (2.19) γ+N + eNγ

−N = 0 ⇐⇒ γ+

N =−eNγ−N holds. Equation (2.20) can be expressed in normal cone and proximal point formulation:

−ξN ∈ NCN(ΛN ), CN ∈ R+, (2.22)

ΛN = proxCN(ΛN − rξN ), CN ∈ R+, r > 0. (2.23)

18

Page 19: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

2.3 Template models

2.2.3 Friction Law

The well know Coulomb friction model

λT ∈ −µλNSign(γT ) (2.24)

for dry friction is used in this report to relate the normal force λN and the sliding velocity γT tothe friction force λT in the direction tangential to the contact surface. γT Can be obtained by

γT =∂gT∂q

dq

dt+∂gT∂t

, W TT q + wT . (2.25)

Only in closed contacts ICcl(2.18) friction forces can exist .

The set of admissible friction forces CT is then:

CT = λT | −µλN ≤ λT ≤ µλN (2.26)

ξT , γ+T + eTγ

−N (2.27)

Since eT is almost always zero, the post impact velocity in tangential direction is typically notrelated to the pre-impact velocity in tangential direction. In that case ξT = γ+

T .

ΛT ∈ −µΛNSign(ξT ) (2.28)

The normal cone and proximal point formulation are given by

−ξT ∈ NCT(ΛT ), CT = ΛT | −µΛN ≤ ΛT ≤ µΛN (2.29)

and

ΛT = proxCT(ΛT − rξT ), Ct = ΛT | −µΛN ≤ ΛT ≤ µΛN, r > 0, (2.30)

respectively.

2.3 Template models

Template models are models of particular walkers that have certain properties that allow analysisand controller design. There are many template models available in the literature [11], but inthis report only the compass walker model is being used. The compass walker is the most oftenstudied walker example because it is the simplest possible walker. As a result, much insight hasbeen obtained in the dynamics of this walker.

2.3.1 Compass Walker

The model used in this work corresponds with the model in [31],[5], which is recapitulated here.A schematic overview of the walker is presented in Figure 2.2.

19

Page 20: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Modeling and simulation of legged robotic locomotion

Figure 2.2: schematic representation of the compass walker.

The walker descends a slope γ and a control input T can be applied to actuate the inter-leg angleφ. The system is further simplified by the assumption that all mass in concentrated in the hipmM . The equations of motion read:

θ = sin(θ − γ) (2.31)

φ = sin(θ − γ)(θ2 − cos(θ − γ)) + sin(θ − γ) + T (2.32)

When a collision occurs it holds that φ = 2θ. The collision relation reads:

θ+n+1 = cos(2θn+1)θ−n+1 (2.33)

φ+n+1 = cos(2θn+1)(1− cos(2θn+1))θ−n+1 (2.34)

The equations of motion are integrated starting from an initial condition until: 1) the robot ac-complishes a successful step

θ < 0 ∧ φ < 0∧ ‖ 2θ − φ ‖< ε (2.35)

2) the robot falls¬(−π/2 ≤ θ ≤ π/2) (2.36)

3) the swing legs reaches too far¬(−π ≤ φ ≤ π). (2.37)

2.3.2 Four-link two-dimensional walker

A model of a four link walker has been made using the Lagrangian complementarity framework(Figure 2.3). The Denavit-Hartenberg convention [20] has been applied to define the coordinates.The model can be integrated with a time stepping solver.

20

Page 21: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

2.3 Template models

x22

y22

yc

θcL11

Lcm11

y11

x11

q11→22

q11→21

x0

y0

x21

y21

y31

x31

y32

x32

CL31b

q21→31

q22→32

Lcm21

Lcm22

Lcm31

Lcm32

CL11a

CL21a

CL31a

CL22a

CL32a

O11

O21

O22

O31

O32

L21

L31

L22

L32

CL32b

xc

Figure 2.3: schematic representation of the four-link two-dimensional walker, Denavit-Hartenberg convention.

2.3.3 Seven-link two-dimensional walker

A simulation model of a seven link walker has been made using the Lagrangian complementarityframework plus time stepping solver (Figure 2.4). The Denavit-Hartenberg convention [20] hasbeen applied to define the coordinates. The model can be integrated with a time stepping solver.

21

Page 22: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Modeling and simulation of legged robotic locomotion

x22

y22

yc

xc

θc

L11

Lcm11

y11

x11

q11→22q11→21

x0

y0

x21

y21

y31

x31

x41

y41

x42

y42

y32

x32

CL42a CL42b

CL41a CL41b

q21→31

q22→32

q31→41

q32→42

Lcm21

Lcm22

Lcm31

Lcm32

Lcm41

Lcm42

CL11a

CL21a

CL11b

CL31a

CL22a

CL32a

O11

O21

O22

O31

O32

O41

O42

L21

L31

L22

L32

Figure 2.4: schematic representation of the seven-link two-dimensional walker, Denavit-Hartenberg convention.

22

Page 23: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

2.4 Simulation

2.4 Simulation

The integration algorithm somehow needs to deal with the the hybrid nature of the walker. Oneway to handle the hybrid nature is to simply integrate the system using classical routines un-til an event occurs and then let a separate function select the proper new initial conditions anddynamics for the subsequent part of the simulation based on the pre-event state. Next, the clas-sical integration scheme integrates until another event occurs. When following this approachthe time steps near the event should be carefully selected as a too large time steps will causean improper event detection. Many algorithms, step backward in time once an event has hap-pened if the error is outside some bound and decrease the step size until the error is within somebound. Event detection is unfeasible when the hybrid system exhibits Zeno-behavior, i.e. thereare infinitely many events within a finite time interval. The compass walker template model isintegrated with the Matlab ode45 solver over small fixed time steps. However, in general onemight prefer an integration scheme capable of handing Zeno-behavior and which is still efficientif the number of impacts is huge. One candidate scheme capable of doing this is time-steppingwhich is explained in more detail.

2.4.1 Time-stepping simulation

Time-stepping methods for mechanical rigid body systems use a time discretization of the gener-alized positions and velocities. Forces acting on the system are taken into account in an integralway over every time step, as a result no distinction is made between impulsive and finite forces.Only increments of the positions and velocities are computed. The accelerations are not com-puted as they are allowed to be infinite (due to impulses). Every time step it is firstly determinedif and which contacts are closed at the midpoint of the proposed time step. If some contacts areclosed the positions and velocities at the end of the time-step are found by solving the contactproblem. The contact problem can for example be analyzed by solving an exact regularization/ an Augmented Lagrangian approach. In this report Moreau’s algorithm with an augmentedLagrangian approach [22] is being used.

We show a simulation wherein time stepping is being used to integrate a mechanical system withimpacts. The two cart system shown in Figure 2.5.

m1 m2

k1 k2

x1 x2

Figure 2.5: two carts system schematic.

23

Page 24: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Modeling and simulation of legged robotic locomotion

The two carts system dynamics are defined by[

m1 0

0 m2

]x =

[−k1 x1 + k2 (x2 − x1 )

−k2 (x2 − x1 )

]+[λ0

]. (2.38)

Herein m1 and m2 are the masses, k1 and k2 the stifnesses of the springs and λ is the Lagrangemultiplier representing the contact force. Furthermore, the system is subject to impacts

x+1 = −enx−1 (2.39)

. Herein en is the restitution coefficient. The parameter values and simulation settings are shownin Figures 2.6 and 2.7. Te in these represents the simulation end time. The results of the twocarts simulation using time stepping are presented in Figures 2.8, 2.9 and 2.10.

Figure 2.6: model parameters.

m1 m2 k1 k2 envalue 1 1 1 10 0.5

Figure 2.7: initial values.

x1 x2 x1 x2 λ

initial value 0.3 0.3 0 0 0

Figure 2.8 shows the evolution of the states x1, x2, x1 and x2 as a consequence of the initialcondition in Figure 2.7. The jumps in the velocity x1 when the first cart impacts the wall areclearly visible. In Figure 2.9 the impact wall is visible by the dense region right of the line x1 = 0.In Figure 2.10 the impact times and corresponding impact impulses are visible

2.4.2 Time-stepping versus event detection

The time stepping method is particularly efficient to integrate systems in which many eventsoccur because the events are taken into account in an integral way. Some systems exhibit Zeno-behavior, i.e. infinitely many events in a finite time span can occur. In this case event detectionis not possible because infinitely many events have te be computed. The time stepping methodcomputes the integral effect of al events and is in this way able to approximate the effect of anaccumulation of events due to Zeno-behavior. The presented legged locomotion model has Zeno-behavior, though only in very specific situations. Imagine for example that all internal degrees offreedom are fixed and the walker falls, having an impact law like γ+

N = −eNγ−N wherein eN isbetween 0 and 1 results in an infinite number of impacts in finite time. However, such problemcan be circumvented by assuming eN is 1, fully elastic impact, or 0, fully plastic impact.

A disadvantage of most currently used time stepping methods is that they integrate by fixed timesteps. In some parts of the motion very small time steps are required to obtain a reasonablyaccurate solution, while in other parts large time steps can be used. Thus some parts of the simu-lation are unnecessarily slow. Another disadvantage is that some intermediate contact situationscould be missed. This is a serious problem in our case because a control scheme is proposed thatsteers the system form one contact arrangement to another. Therefore, eventually event detectionis selected and used in the sequel of this report even though considerable effort was put in thesimulation of legged locomotion by time stepping.

24

Page 25: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

2.4 Simulation

0 10 20 30 40 50 60−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

t [s]

x 1 [m]

0 10 20 30 40 50 60−0.1

0

0.1

0.2

0.3

t [s]

x 2 [m]

0 10 20 30 40 50 60−0.3

−0.2

−0.1

0

0.1

0.2

t [s]

v 1 [m s

−1 ]

0 10 20 30 40 50 60−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

t [s]

v 2 [m s

−1 ]

Figure 2.8: two carts time behavior (v1 = x1 and v2 = x2).

−0.1 0 0.1 0.2 0.3−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

x1 [m]

v 1 [m s

−1 ]

−0.1 0 0.1 0.2 0.3−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

x2 [m]

v 2 [m s

−1 ]

Figure 2.9: two carts phase plane(v1 = x1 and v2 = x2).

0 10 20 30 40 50 600

5

10

15

20

25

30

t [s]

Lam

bda

[Ns]

Figure 2.10: two carts Lagrangemultipliers.

25

Page 26: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Modeling and simulation of legged robotic locomotion

2.5 Conclusion

An important concern in perhaps any research is how to arrive at a proper model that explainsthe behavior of interest. In the literature there are basically two approaches to model legged lo-comotion. One approach is to model only the relevant dynamics of a particularly simple systemand simplify it even further such that some analytical analysis can be performed. This approachis omnipresent in the literature of legged locomotion and the related models are called templatemodels. The other approach is to list all possible physical processes that can affect a class of sys-tems and put them in a general framework. Next, special properties of this general frameworkcould be exploited. In the literature it has been pointed out that the hybrid systems framework ormore specifically the complementarity Lagrangian framework is a candidate for a general frame-work to capture the dynamics of legged machines.

A template model of a compass walker has been selected from the literature. This is the bench-mark system throughout this report because relevant analytical results are available for thissimple system. The hybrid systems framework and more specifically the complementarity La-grangian systems have been selected from the literature as general frameworks. Ideally this gen-eral framework helps to explain or frame statements about legged locomotion at a high generallevel. In Chapter 6 it has been pointed out that this is the case as the natural distinction betweenmodes in the general framework based on contact situations is used to redefine balance.

Numerical simulation is required to approximately compute some of the consequences of themodels as the consequences of interest often cannot be derived analytically. Hereto the differen-tial equations that describe the dynamics of the system need to be integrated. Legged machineshowever exhibit discontinues behavior, that is impacts and stick slip behavior, which complicatesintegration. To deal with these difficulties, two approaches of a very different nature have beenselected from the literature: integration with event detection and time stepping. In the remainderof this report event detection is used.

26

Page 27: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Chapter 3

Balance and balancebility of leggedrobotic locomotion

This chapter treats the stability analysis of legged locomotion. It is to a some extend still an openproblem how stability of legged locomotion should be characterized. In the legged locomotionfield there exists much confusion about stability as the same term is being used for two distinctconcepts. One concept relates to the (asymptotic) convergence to a particular motion. The otherrelates to not falling, or even more general, being able to continue a behavior like executing atask. In order to contrast them unambiguously we name the latter kind of stability balance [18].

Balance is often achieved by tracking predefined motions that guarantee balance when trackedproperly. Decent tracking requires stability of the tracking controller to attain certain robustnessagainst disturbances. However, this kind of stability is relatively well understood as it has beenwidely studied in the field of robotic manipulators. Robotic manipulators are already operationalfor many years in factories around the world. Balance however, which is central in legged locomo-tion, is to date still a poorly understood concept. It is e.g. not clear at all what kind of constraintson trajectories should be defined in order to guarantee balance. This chapter is about balance.

Section 3.1 introduces a definition of balance and some approaches to achieve balance. Section3.2 discusses balancability, the extend to which a controller can balance a legged machine. Inparticular it treats balancability of walking of the the compass walker.

3.1 Legged locomotion balance

A balanced gait has been explicitly defined by [21] as a locomotion which does not contain anyfalling states, although they do not name it balance, but stability. Firstly, a fall is defined as:

Definition 3.1 (Fall). A biped is considered in fall when a point on the biped, other than apoint of the feet of the biped, touches the ground.

Next, a balanced gait is defined as:

Definition 3.2 (Balanced gait). Let x = f(x, u), x ∈ Rn and F ⊂ Rn. Denote with F a setwhich contains all robot configurations that are defined as a fall. Let B be the basin of fall,which is the subset of the state space that leads to a fall: B ⊂ Rn, s.t. ∀x(t) ∈ B, ∃∆t ≥ 0s.t. x(t+ ∆t) ∈ F . A robot gait is balanced if and only if the state of the robot is not insidethe basin of fall B.

27

Page 28: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Balance and balancebility of legged robotic locomotion

We adopt this definition of balance as it is very general and intuitively clear. Unfortunately, itis difficult to determine B exactly. Therefore, criteria that approximate balance in some way areoften used. Most of these criteria were developed well before this proper definition of balance.Existing approaches in the literature to approximate balance can be classified into four majorgroups: 1) Design of balanced trajectories plus a tracking controller; 2) Stabilization of limitcycles (Poincare sections); 3) Hybrid systems reachability analysis; 4) Learning algorithms thatlearn non-falling from experience. These are discussed in the sequel.

3.1.1 Tracking of balance-constrained trajectories

This method to balance legged locomotion requires a trajectory design and a tracking controllerdesign. Constraints on the trajectories assure that the robot is balanced when tracking the de-signed trajectories. However, all sets of constraints designed thus far are overly restrictive andyet often not sufficient. They are approximations to achieve balance. Actually, it is still an openproblem how to define proper trajectory constraints. The pioneers in legged locomotion oftendesigned trajectories simply by trial and error or by recording them from humans. Later criteriafor balance were defined. The most commonly used criteria are mentioned below.

Center of Mass (CoM)

The CoM balance constraint basically states that the center of mass must stay above the convexhull spannend by the supporting feet. The stability margin belonging to this condition is thedistance between the center of mass and the edge of the convex hull. This is a static stabilitycondition. It is neither a necessary, nor a sufficient condition for balance and one of the mostconservative.

Zero Moment Point (ZMP)/ Center of Pressure (CoP)

The CoP and ZMP are largely equivalent depending on the precise definitions. Their interpreta-tion/computation however is different: The CoP is computed from the contact forces, while theZMP is computed using the state and acceleration of the robot [24]. Consider Figure 3.1, the zeromoment point is the point on the foot for which holds

∑Mx = 0 and

∑My = 0 [17]. It is not

required that∑Mz = 0 if friction forces prevent rotation about the z-axis.

xy

zZMP

∑Mx = 0∑My = 0

Figure 3.1: zero moment point.

Definition 3.3 (Zero Moment Point (ZMP)). Let the tipping moment be the component ofthe moment that is tangential to the support surface. The ZMP is the point on ground wherethe tipping moment acting on the robot, due to gravity and inertia forces, equals zero.

Definition 3.4 (Center of Pressure (CoP)). The forces normal to the robot sole can bereplaced by a single resultant force exerted at the point where the resultant moment is zero.This is the center or pressure (CoP)

28

Page 29: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

3.1 Legged locomotion balance

The ZMP is only defined if more than two points contact the ground and all contact points liein the same plane. In this case the robot gait is considered to be balanced if the ZMP lies in theconvex hull spanned by the robot contacts. Actually it only means that the foot does not rotate.This condition is not necessary and neither sufficient, because a robot can still fall even if its footdoes not rotate and a robot can walk even if its foot rotates. According to the above definitions theZMP and CoP are equal. However the statement that the resultant moment must be zero in thedefinition of the center of pressure is sometimes sometimes omitted. In that case the CoP is stilldefined when the foot rotates, but the ZMP is not.

Foot Rotation Indicator (FRI)

Definition 3.5 (Foot Rotation Indicator (FRI) [6]). The FRI point is the point on the groundwhere the net ground reaction force would have to act to keep the foot stationary given thestate of the biped and the acceleration of its joints.

If the foot is stationary, the FRI is equal to the ZMP/CoP. However, if the foot experiences arotational acceleration, then the FRI is outside the support convex hull whereas the ZMP/CoPis on the edge of the support convex hull. The FRI is also named fictional zero moment point(FZMP). The relevance of the FZMP/FRI is however widely disputes, as the robot foot experiencesa rotational acceleration when the FZMP/FRI does not coincide with the ZMP, so generally thatcase is considered unbalanced.

Selection between balanced trajectories

Balance conditions alone mostly do not yield an unique trajectory. A common approach in recentworks to favor one trajectory over another is energy minimization [13], [28]. Legged machines,in contrast to manipulators, must deal with several kinds of environments or terrains, and oftenthese kinds of terrains require different trajectories. Therefore, the set of trajectories which isneeded to operate well is rather large. Trajectory selection and concatenation of subtrajectoriesis for example investigated by [13], [14]. Trajectory selection often requires rich input from whichthe environment can be reconstructed, e.g. computer vision input. Handling and processing thisvast data-stream realtime is perhaps one of the greatest challenges in robotics.

3.1.2 Limit cycle walking

Legged locomotion balance can be analyzed with the theory available on limit cycle stability ascyclic motions can be defined which do not contain any falling states, i.e. a balanced cyclic motion.Convergence to these motion can under certain conditions be proven. The motions of the robotare in general not cyclic. Cyclicity, however, has to be assumed in order to apply this theory.Note that with some creativity almost all motions can be defined as cyclic, including starting andstopping. Stable periodic gaits share the property that the robot will return to its periodic motion(limit cycle) after small disturbances. Typically, the stability of such periodic gaits / limit cycles isanalyzed by Poincare return maps. The Floquet multiplyers of the return map must lie in the unitcircle (one eigenvalue being one and the others smaller than one). The magnitude of the largesteigenvalue (except the eigenvalue on the unit cycle) can be used as a stability margin. This theoryis only valid for small disturbances. The method has been widely used in literature and one ofthe first who applied it to walkers was [12]. A recent comprehensive work on limit cycle walkingis [9].

29

Page 30: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Balance and balancebility of legged robotic locomotion

3.1.3 The energy capture balance condition

A more recently developed balance condition defines legged locomotion to be balanced if it cancapture its kinetic energy within a certain number of steps [21], i.e. if it can reach standstillwithout falling in a certain number of steps. This is more precisely defined by the following setdefinitions. Firstly, a capture state is defined:

Definition 3.6 (Capture State). A capture point is a state in which the kinetic energy of thebiped is zero and can remain zero with suitable joint torques.

Secondly, a safe feasible trajectory:

Definition 3.7 (Safe Feasible Trajectory (SAT)). A safe feasible trajectory is a trajectorythrough the state space which is consistent with the robot’s dynamics, is achievable by therobot’s actuators and does not contain any falling states.

The definitions of a capture state and SAT are used to define a capture point:

Definition 3.8 (Capture Point). For a biped in state x, a capture point, p, is a point on theground where if the biped covers, p, either with its stance foot or by stepping to p in a singlestep, and then maintains its center of pressure to lie on p then there exists a safe feasibletrajectory that ends in a capture state.

The capture point can be defined in a recursive way:

Definition 3.9 (N-Step Capture Point). The N-Step Capture Point is a point on the ground,p, such that if the biped swung its swing leg to cover p with its foot and maintained its centerof pressure to lie on p, then there exists a safe feasible trajectory such that at some state alongthe trajectory there exists an N − 1 step capture point.

By using these definitions balance can be defined by the existence of a capture point up to somenumber of steps N . If there exist an infinity (N = ∞) capture point for some control law thismeans that the walker will never fall using that control law. The difficulty lies in checking whetherthese criteria are satisfied. Zero step capturability is relatively straightforward to check, one stepcapturability is more difficult, two-step even more, etc.

In addition to the balance requirement, it is desirable to have some balance margin. By using aN-Step Capture Region

Definition 3.10 (N-Step Capture Region). The N-Step Capture Region is the set of all Nstep capture points.

[21] defines the N-Step Capture Margin:

Definition 3.11 (N-Step Capture Margin). The N-Step Capture Margin is the maximumdistance from points in the N-step capture region to their nearest boundary of the reachableregion if the N-step region is non-empty. Otherwise, it is the negative distance from theunreachable N-step capture region to the reachable region.

Basically, these definitions state that the robot must be able to reach a standstill within N steps.There is some set of states for which this is possible and that set of states has an edge. The closestdistance to that edge is a margin of balance.

30

Page 31: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

3.1 Legged locomotion balance

This definition fits into the hybrid systems framework because not falling is essentially equivalentto avoiding certain contact modes and balanced locomotion tracing a set allowed contact modes.Reachability theory in the hybrid systems field is concerned with questions like what modes canbe reached departing from a particular state. This suggest that theory from this field could be veryuseful for walker stability proofs. However, reachability analysis can currently only be analyticallyperformed for very simple systems and not for the complex walkers.

Wisse [31], followed a bottom up approach which explicitly aims to avoid falling modes. In hiswork a compass walker model is analyzed and requirements for not falling forward or backwardhave been identified. On the basis of these requirements a PD controller is selected as an examplecontroller that prevents falling forward for a larger set of initial conditions. The simple examplepermits detailed analysis such that theoretical bounds could be derived that separate falling fromnon-falling states for any controller. This could be interpreted as one of the first results concern-ing non-falling criteria. A practical outcome of their work is that it could be stated for this simplesystem that in order to prevent falling forward, the swing leg should be swung forward quicklyand then kept there at not to large angles in order to prevent falling backward the next step. Notfalling backward relates to having enough energy to reach the apex. These are quite practicalguidelines for controller design in order to satisfy the aforementioned conditions for this particu-lar simple system. Later in [10], robustness of a controller to unbalancing disturbances has beeninvestigated in a more general way.

3.1.4 Reverse engineering of balance: learning

Recent approaches to keep robots upright concentrate on letting them detect when they fall andlet them associate a falling state to pre-fall states such that it is able to avoid a fall next time.Or similarly by detecting when a successful step takes place and associate this with pre-successtates. In this way ideally the robot learns how to stay outside the basin of fall by itself and"stores" this in a tabular or associative structure [29], [25], [30], [15]. It finds a controller thatavoids falling states or steers the system to some target states. Balance is not analyzed, but isreverse engineered to satisfy the most general definition. At the end of the process the stabilitycriterium is automatically satisfied in some (approximate) sense.

However, to date applicability of learning is limited due to computational burdens. Also, it hasnot yet been discovered what is the best way to define the learning process. Much research hasbeen carried out to reduce the computational burdens of learning, e.g. hierarchical learning [4],multi-agent learning [30], variable resolution.

In this report we use reinforcement learning to find a balance controller for the compass walkerin a more strict way than in most previous works. More strict in the sense that the learning algo-rithm only searches for balanced walks, not energy efficient, or natural looking walks. The pur-pose of this experiment is to demonstrate by means of a visualization that such an algorithm canindeed fill in the maximally balancable region and complies with the general balance definition.Some theoretical results on the compass walker system are available which make it a suitablebenchmark system. They will be presented in the Section 3.2.1. Our objective is to obtain somemore insight in how to define the learning problem.

31

Page 32: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Balance and balancebility of legged robotic locomotion

3.2 Legged locomotion balancability

Controllability, a well know concept, is concerned with the question what region of the state spacecan be reached using any possible input. Similarly, balancability is concerned with the questionin what region of the the state space balance can be achieved by choosing any possible controlinput. Some walkers, are not able to stand up, so if they fall, balance has been lost, but alsobalance cannot be restored by any control input. They could be named partly unbalancable.

3.2.1 Compass walker example

In this example it is investigated for a simplified compass walker from which initial conditionsa step can successfully be accomplished. We name this set of initial conditions the basin ofattraction of walking. The compass walker has a theoretical limit to the largest possible basinof attraction to all walking motions achievable by any controller. The largest possible basin ofattraction to all walking motions achievable by any controller is the basin of balance. The extendto which a controller fills in this theoretical basin of balance is perhaps a decent measure for itsperformance. The basin of attraction to walking for the compass walker has been derived by [31].

The basin of attraction to walking of the simplified compass walker is bounded by a number oflines [31]. These lines separate walking from falling. Or more specifically, from falling forward,backward and running (Figure 3.2). The slice in this figure represents the double stance phasewhere both feet contact the ground. The state φ is not relevant in this slice, because the legscreate a closed triangle with the ground surface such that φ and θ are related.

Figure 3.2: actuated compass walker basin of balance [5].

Falling backward will occur if the hip forward velocity is too low to reach the apex. Falling back-ward cannot be prevented by swing leg control, which is the only control available in this example,because an approximate case is studied where there is no mass in the feet (all mass is concen-trated in the hip). The condition that the apex must be reachable can be expressed by:

12M(θl)2 > Mgl(1− cos(θ)). (3.1)

32

Page 33: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

3.3 Conclusion

Running will not occur when compressive foot contact is maintained. This no-running conditioncan be expressed by:

|θ| <√

cos(θ). (3.2)

A third condition is that the walker should start with a positive interleg angle:

φ > 0. (3.3)

3.3 Conclusion

In the field of legged locomotion the effort to keep a legged machine upright is often namedstabilization. Recently, the term stability is replaced by the term balance. This associates betterwith the effort to keep the robot upright. Therefore, this term has been adopted in this report.Stability has a rather different meaning in the field of dynamical systems which is important inlegged locomotion as well. So if not replaced by balance, it has a double meaning.

From the available literature on balance the most general balance definition has been selectedwhich basically states that some predefined states, named the set of falling states, should neverbe visited during any execution of a task by the robot. This work argues, closely following [21],that none of the presently available balance conditions comply with this proper general definition.They are generally overly conservative and not sufficient. However we envisage, and it has beenenvisaged by others, that, perhaps, it is possible to define a learning / optimization process suchthat if convergence is reached, the general balance condition emerges. Many efforts in this reportare a result of our curiosity whether and how this could indeed be achieved. The subsequentchapters continue this discussion.

A compass walker has been selected form the literature as a benchmark system. This system isa suitable benchmark system to validate our ideas in this report, because analytical results areavailable for this system. Most importantly, the maximal region of attraction to walking has beendetermined during previous research. An attempt to learn a controller by reinforcement learningthat fills in the whole theoretical maximal region of attraction has been made later in this report.

33

Page 34: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Balance and balancebility of legged robotic locomotion

34

Page 35: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Chapter 4

Reinforcement learning

The reinforcement learning problem is the problem of learning from interaction to achieve agoal [26]. An agent acts in an environment and as a consequence of its actions its state changes.Every state, action or state change is valued in advance by the definition of the learning prob-lem. The problem to be solved by the reinforcement learning algorithm is how to maximize thesum of all future rewards by choosing a sequence of control inputs. This chapter introduces thereinforcement learning problem and Q-learning, one of the most utilized algorithms to solve it.

This chapter is organized as follows. Section 4.1 briefly compares reinforcement learning withother control schemes. Section 4.2 introduces the reinforcement learning problem in more detail.Section 4.3 introduces the Q-learning algorithm which is widely used to solve the reinforcementlearning problem. Section 4.4 discusses one of the most straightforward example reinforcementlearning problems and demonstrates how Q-learning solves it.

4.1 Reinforcement learning and classical control

How does reinforcement learning compare with other control schemes? Control is concernedwith altering the behavior of an existing system such that it behaves to the user’s requirements.There exist several kinds of control. They could be classified in many ways, however we classifythem such that it is clear where learning stands. The most straightforward control is a fully prede-fined controller which does not alter itself in any way, even if the environment or system changes.A bit more complex controller, called an adaptive controller changes some parameter values ofitself when the system changes a little bit. For example, if a temperature change alters the ge-ometry or friction, etc. This kind of controller does not remember anything, it however adapts toother situations and the way how to adapt is predefined. An even more complex controller canlearn, it can adapt, but also store how it has been adapted for some particular situation and accesthat knowledge. As a result, it can adapt faster when approximately the same situation occurs asecond time. In this sense reinforcement learning controllers learn. They respond better whenapproximately the same situation occurs a second time.

Control beyond learning can have qualities as anticipation, generalization, association and com-pression in addition to learning. Anticipation is taking an action based on an expected futurestate. Anticipation does not necessarily require learning as it can be done by static model andmeasurement of the current state. Reinforcement learning can anticipate, because it is able tosolve temporal credit assignment problems. That is, it can learn how to act in order to maximize

35

Page 36: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Reinforcement learning

some function over future time. In this sense reinforcement learning can also associate, as it re-lates future values to current actions. Some extended reinforcement learning algorithms are ableto generalize. Which means they can learn how to act in states that have not been visited beforebased on experiences in other states. However, we do not use generalization techniques in thisreport. Compression, is also possible in reinforcement learning but this is really at the forefrontof current research. Compression concerns with ignoring irrelevant information for the functionto be maximized in order to accelerate learning and reduce memory requirements. It could becompared with the act of reducing the complexity of a simulation model because many modeledphenomena have marginal effects on the outcome of the simulation but increase complexity andcomputational requirement severely. We do not utilize extensions of reinforcement learning thatcompress.

Apart form being classified as a learning algorithm, reinforcement learning could also be per-ceived as a kind of optimal control [1] problem. Optimal control selects control actions such thatsome performance/cost function is maximized/minimized. Reinforcement learning maximizesexpected returns over time which could be cast in one equation to be maximized. However, thatis not the way the reinforcement learning problem is typically solved.

4.2 Reinforcement learning problem

Reinforcement learning is defined as a learning problem. The reinforcement learning problemis the problem of learning from interaction to achieve a goal [26]. Interaction means that anagent acts in an environment and in return the environment reacts to the agent. The result is atransition to a new state. To every transition form one state s ∈ S to another s′ ∈ S using anaction a ∈ A(s), a reward r is coupled by a reward rule. Thus the reward is a function of the stateand the inputs r = r(s, a). The reward rule is designed such that the reward is maximal whenthe system behaves as desired.

Thus while interacting with the environment the agent receives rewards and the goal of the agentis to maximize the received rewards over time. The reinforcement learning problem is the prob-lem of finding the sequence of actions that maximizes all future rewards and the methods thattackle the reinforcement learning problem are named reinforcement learning methods.

Before going into more detail, some definitions and notations from the field of reinforcementlearning are introduced following Sutton and Barto [26] (another classical work is [1]) to formal-ize the discussion. A policy π assigns to each s ∈ S an a ∈ A. When the systems transits fromone state to another, a reward rt is generated. Rt, the return, denotes the sum of future rewardsdiscounted to timestep t. So Rt is the present value of all future rewards. Future rewards depart-ing form time step t are numbered rt+1, rt+2, ..., rT , with T being the final time step. The finaltime step drives the system to a terminal state ending the particular episode.

A reset mapping maps the terminal state back to the initial state of an episodic task. Non-episodictasks continue forever, that is T = ∞. Due to this property, non-episodic tasks lead to infinitereturns if future rewards are not discounted by a factor 0 ≤ γ < 1. Episodic tasks allow γ = 1.

Definition 4.1 (Return). Rt = γ0rt+1 + γ1rt+2 + γ2rt+3 + ... =∑k=T

k=t γkrt+k+1

The return is often not known in advance, therefore the expected return is introduced. Theexpected return depends on the policy and the state the system is in. The state-value function

36

Page 37: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

4.3 Q-learning

V π(s) denotes the expected return departing from state s and following policy π thereafter. Theaction-value function Qπ(s, a) denotes the expected return departing form state s, taking actiona and following policy π thereafter. Pr denotes a probability and E an expected value.

Definition 4.2 (state-value function for policy π). (Sutton and Barto [26]) The value ofa state s, under a policy π, denoted V π(s), is the expected return when starting in s andfollowing π thereafter. V π(s) = EπRt|st = s

Definition 4.3 (action-value function for policy π). (Sutton and Barto [26]) The value oftaking action a from state s under policy π, denoted Qπ(s, a), is the expected return startingfrom s, taking action a, and following policy π thereafter. Qπ(s, a) = EπRt|st = s, at = a

The action-value function is used for action selection because it indicates what actions are proba-bly beneficial to select being in a certain state.

Learning by interaction can be perceived as an iterative process. Firstly, an action is executedand as a result the system arrives in a new state; Secondly, a reward is collected corresponding tothe state transition; Thirdly, the estimates of the value functions are updated (the learning step);Fourthly, the action selection rule is updated; Lastly, a new action is taken and the interactionprocess repeats itself. The fourth step is optional and often interwoven with the third step, thevalue function update.

If the fourth step, a policy update, is present and interwoven with the value function updatingprocess, the reinforcement learning method is named an on-policy method. On-policy methodsattempt to evaluate or improve the policy to make decisions. In contrast to on-policy methods,off-policy methods split the policy to generate behavior, the behavior policy, from the policy toimprove estimated values, the estimation policy.

Convergence of learning methods is guaranteed as long as all states are visited infinitely manytimes. This happens if all states are observable, reachable and the action selection mechanismensures that all states continue to be visited. This holds for finite Markov decision processes(finite MDP’s) in combination with an action selection rule that assures that all states continueto be visited. In practice, all environments are only partially observable (named partially observ-able Markov decision processes - POMDP) and generally partly reachable. In addition, an actionrule that keeps visiting all states is often not desired since only some states are the goal states.Therefore, convergence can generally only be partially guaranteed.

4.3 Q-learning

The challenge for the reinforcement learning methods is to acquire an accurate estimate of theaction-value function Qπ(s, a) by as less interaction as possible. Systematically going through allpossible sequences is generally too involved. Therefore, most reinforcement learning methodshave some kind of bootstrapping technique that builds a new estimate on top of a previous es-timate. A bootstrapping method which is often used is temporal difference learning (TD). TDupdates the current estimate of the state-value function with the difference between a new esti-mate of the return Restt and the old estimate V (st) times a learning parameter α. TD methodsare called bootstrapping methods because they generate a new estimate on the basis of a previousestimate.

V (st)← V (st) + α[Restt − V (st)] (4.1)

37

Page 38: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Reinforcement learning

Estimates of the return Rt can be generated in the following ways.

TD(0) : Restt = rt+1 + γV π(st+1)TD(1) : Restt = rt+1 + γrt+2 + γ2V π(st+2)TD(.) : Restt = ...

TD(λ) : Restt = rt+1 + γrt+2 + ...+ γn−1rt+n + γnV π(st+n)

λ Denotes the depth of the backup trace.

One of the algorithms that utilizes TD learning to update the action-value function Qπ(s, a) isQ-learning. The update Q(st, at) is calculated as follows:

Q(st, at)← Q(st, at) + α[rt+1 + γmaxa

Q(st+1, a)−Q(st, at)] (4.2)

The TD learning rule is integrated in Q-learning as indicated in pseudocode 4.1.

Algorithm 4.1 Q-learning1: Initialize Q(s, a) arbitrarily2: repeat(for each episode)3: Initialize s4: repeat(for each step of episode)5: Select a form s using a policy derived from Q6: Take action a, observe r, s′

7: Q(st, at)← Q(st, at) + α[rt+1 + γmaxat+1 Q(st+1, at+1)−Q(st, at)]8: s← s′

9: until s is terminal state of episode10: until all episodes passed

Line 5 in pseudocode 4.1 states: select a form s using a policy derived from Q. It is called theaction selection rule or policy and it selects an action on the basis of the information gathered inQπ(s, a). An often used action selection rule is epsilon-greedy. This action selection rule selectsthe action with the highest expected return (1− ε) part of the time and a random action ε part ofthe time (0 ≤ ε ≤ 1):

a = arg maxa

(Qπ(s, a)) , p > ε (4.3)

a = random(A) , p ≤ ε

In which p is a randomly drawn number in the range [0, 1]. Section 4.4 provides a concreteexample on which the described Q-learning method is demonstrated.

4.4 Gridworld example

An instructive example which can be solved by reinforcement learning is a navigation task in agridworld. In this example an agent must move from an initial state to a goal state G in a minimalnumber of moves (see Figure 4.1a). The agent is allowed to move in four directions named North(N), West (W), South (S) and East (E) and the reward scheme in this problem is defined suchthat the agent receives a reward only when it arrives in the goal state (see Figure 4.1b). This is

38

Page 39: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

4.4 Gridworld example

a discrete problem, as there are a fixed number of states. After every action the agent arrives bydefinition in a neighboring state unless the agent chooses an action that would cause it to cross aboundary. In that case the agent by definition will just not move as a result of the action.

GN

S

EW

(a) the grid world navi-gation problem. A robotmust move to the goalstate (G) in as less stepsas possible. From everystate (square), it canmove in four directions,North (N), South (S),East (E) and West (W).

0 0 0 0 0

0 0 0 0 0

0 0 1 0 0

0 0 0 0 0

0 0 0 0 0

(b) reward function ofthe gridworld problem.The robot receives a re-ward only when it ar-rives in the goalstate(G).

Figure 4.1: gridworld problem.

In order to learn the shortest pad from every initial condition to the goal state, the agent mustinteract with the environment and what the agent learns must be stored. What is stored is anestimate of the expected return of taking action a in state s, the actionvalue function Qπ(s, a). Sofor every state-action-pair an expected return is stored (Figure 4.3). Initially this memory structureis filled with random values much smaller than the minimum reward that the agent can collect.After the initialization has been completed the learning process can start. In every episode theagent is set to an initial condition and must find its way to the goal. The agent moves fromone state to another by selecting actions which are sometimes completely random, this is calledexploratory action and sometimes based on the action value function. When the actions are basedon the action value function the action is selected which has the maximum expected return.

Let us consider how the actionvalue function is exactly updated when the agent arrives in the goalstate for the first time (Figure 4.2a). In this case the agent receives a reward r which is used toupdate Q(s, a) in the following way:

Q(st, at)← Q(st, at) + α[rt+1 + γmaxat+1

Q(st+1, at+1)−Q(st, at)] (4.4)

Thus since maxat+1 Q(st+1, at+1)−Q(st, at) is very small compared to r effectively Q(st, at)←Q(st, at) + αrt+1. This example has been illustrated in Figure 4.2b wherein α is set to one.

Now let us consider how the action-value function is updated when the agent arrives in thestate that has been updated before but wherein the reward is zero. In this case rt+1 is zero,Q(st, at) is small compared to maxat+1 Q(st+1, at+1) and as a result effectively Q(st, at) ←αγmaxat+1 Q(st+1, at+1). Herein maxat+1 Q(st+1, at+1) is α times the previous reward.

After a large number of episodes the structure of the Q-table could be like in Figure 4.4a. How-ever there is no limit to the absolute values in the table as a reward can be collected infinitely

39

Page 40: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Reinforcement learning

many times, each time increasing the values in the table. The values have meaning relative toeach other as that is what is important for action selection.

G

N=0

S=0E=0.8

W=0

S

W

E

N

expectedreturn

0

0

0.8

0

(a) example of first reward retrieval.

0 0 0 0 0

0 0 0 0 0

0 0.8 0 0

0 0 0 0 0

0 0 0 0 0

(b) maximum Q-valuesafter the first reward re-trieval.

Figure 4.2: example of first reward retrieval.

S

W

E

N

Figure 4.3: schematic overview of the structure of the gridworld Q-table.

0.84 0.83 0.82 0.83 0.84

0.83 0.82 0.8 0.82 0.83

0.82 0.8 0.82

0.83 0.82 0.8 0.82 0.83

0.84 0.83 0.82 0.83 0.84

0.8

(a) maximum Q-value atevery gridworld state.

G

(b) inputs correspond-ing to the maximum Q-values.

Figure 4.4: possible Q-table structure after learning is completed.

40

Page 41: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

4.5 Conclusion

4.5 Conclusion

This chapter introduced the reinforcement learning problem and the well known Q-learning al-gorithm to solve the reinforcement learning problem. It has been demonstrated step by stephow Q-learning can solve a particularly simple gridworld example problem in order to introducethe reader to the basics of reinforcement learning. In the remainder of the report Q-learning isapplied to the reinforcement learning problems of our interest and a number of restrictions andadjustments are proposed in order to obtain a clear interpretation of the action-value function.

41

Page 42: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Reinforcement learning

42

Page 43: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Chapter 5

Reinforcement learning and dynamicsinterpretation

How should the action-value function be interpreted? Can the interpretation of the action-valuefunction be altered? These questions are considered in this chapter.

How does the action value function develop? The sensors produce a set of signals which is in-terpreted as the current state and an action selection mechanism selects an action based on thecurrent action-value function. An output is produced and the system arrives in a new state accord-ing to its dynamics, where it receives a reward. Finally, the reward is translated into the expectedreturn and linked to the executed action as well as the system’s previous state. This results in aupdate of the action-value function. In conclusion the reinforcement learning algorithm storesand iteratively updates a state-action pair to expected return mapping, the action-value function.

The action-value function is shaped by: 1) the dynamics of the system and 2) the reward structure.The dynamics of the system determines the next state the system arrives in as a result of applyinga certain input. This information is a property of the system and its environment (or simulationmodel). The reward function values a certain transition form one state to another. In traditionalreinforcement learning, e.g. Q-learning, only the expected reward is stored, information aboutwhich state the system arrives in as a result of some action is not stored. So what is actually beinglearned is framed in terms of the expected returns, such that the action selection mechanism canselect the action which has the highest expected return.

The physical interpretation of the return is determined by the precise design of the reward func-tion. The freedom to design a reward structure should be utilized as one can define it such thatthe action-value function gets a meaningful interpretation. This is exactly what is demonstratedin this chapter and the next. More specifically, how the reward function can be designed suchthat the action value function gets an interpretation in terms of dynamics. That is, equilibriumsets and their domain of attraction. Often choosing a decent reward structure is considered morekind of an art than it is done by a structured process, this chapter shows it has some advantagesto do it in a structured way.

This chapter is organized as follows. Section 5.1 discusses how the domain of attraction to a setof target states can be reconstructed from action-value function by an example system. Section5.2 discusses how the ability to reconstruct the domain of attraction to some target states can beutilized to reconstruct a domain of attraction to a particular walking motion.

43

Page 44: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Reinforcement learning and dynamics interpretation

5.1 Domain of attraction interpretation

This section demonstrates that the domain of attraction of a dynamical system can be visualizedfrom the action-value function by qubic-linear system example.

Let G be the desired goal region where the systems should ideally converge to. In order to be ableto reconstruct the basin of attraction later, the reward r(s, a) is restricted such that only states inthe goal regions receive a positive reward value in the following way:

Definition 5.1 (Reward restriction condition). r(s, a) > ε if s ∈ G and r(s, a) = 0 if s 6∈ G.

Herein ε has some positive value larger than the maximum random initialization of the action-value function. Let B be the collection of states from which eventually G will be reached. Ormore precisely:

Definition 5.2 (Basin of G). B is the collection of states s(t) for which holds that s(t+T ) ∈ Gif T goes to ∞.

Now the following conjecture can be made:

Conjecture 5.1 (Basin of attraction reconstruction conjecture). s ∈ B if maxa(Q(s, a)) > ε.

In this way the basin of attraction can be reconstructed by computing maxa(Q(s, a)) for every s.The following subsection demonstrates this by an example.

5.1.1 Qubic-linear-system example

Dynamics of the Qubic-system

The dynamics of the Qubic-linear system is defined by the following differential equations:

x1 = −x1(x1 + a)(x1 − a) + u (5.1)

x2 = −x2 (5.2)

The dynamics of this system is straightforward in case u is zero. The system has three equilibria.Two stable nodes (x1, x2) = (−a, 0), (x1, x2) = (a, 0) and one saddle node (x1, x2) = (0, 0).All trajectories initiated at x1 < 0 approach (x1, x2) = (−a, 0), all trajectories initiated at x1 > 0approach (x1, x2) = (a, 0) and all trajectories at x1 = 0 approach (x1, x2) = (0, 0). In conclusion,the basins of attraction of all equilibria are known for this simple system.

Furthermore, the system is controllable in the x1-direction since the original dynamics can befully canceled in that direction and any dynamics can be superimposed. The x2-direction cannotbe controlled since u and x2 are not coupled, but the x2 dynamics is stable.

These properties imply that there exists a controller which globally stabilizes the system and thatthe system can be steered to any point on the line x2 = 0. The basin of attraction of such anequilibrium can consists of the whole state space. For example, the x1 dynamics of the equilib-rium (x1, x2) = (a, 0) is shown in Figure 5.1a. The original domain of attraction is x1 < 0, but asuitable controller can enlarge this domain of attraction to fill the whole state space. An exampleof such controller and corresponding domain of attraction is given in Figure 5.1b.

44

Page 45: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

5.1 Domain of attraction interpretation

x1

x1

−a 0 a

DoA of equilibrium a

(a) basin of attraction of the (x1, x2) = (a, 0) equi-librium.

x1

x1

−a 0 a

maximal controlled DoA equilibrium a

umax

(b) maximal basin of attraction of the (x1, x2) =(a, 0) equilibrium.

Figure 5.1: x1 dynamics of the qubic-linear system.

The dynamics of these systems are well understood which makes it suitable to investigate if thesedynamics can be identified from the action-value function which develops by reinforcement learn-ing. In order to demonstrate that this is indeed possible simulations are performed. Simulation1 demonstrates how the basins of attraction becomes visible in the Q-table (action-valuefunction)of Q-learning. Simulation 2 demonstrates how Q-learning enlarges the basin of attraction andhow this is visible in the action-value function. Both simulations have (x1, x2) = (a, 0) as thedesired equilibrium (goal) and the other equilibria as undesired equilibria (sink). The basins ofattraction for the uncontrolled system have been determined by simulation in order to check ifthe implementation is correct. The basins of attraction are shown in Figures 5.2a and 5.2b fora = 1.

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5goal basin of attraction

x1

x2

(a) basin of attraction of the (x1, x2) = (−a, 0)equilibrium investigated by simulation (x1 andx2 between −2 and 2).

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5sink basin of attraction

x1

x2

(b) basin of attraction of the (x1, x2) = (a, 0)equilibrium investigated by simulation (x1 andx2 between −2 and 2).

Figure 5.2: basins of attraction of (x1, x2) = (−a, 0) and (x1, x2) = (a, 0).

In contrast to the gridworld example in Chapter 4, this system is a continuous dynamical system.

45

Page 46: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Reinforcement learning and dynamics interpretation

The continuous dynamics must be discretized in order apply reinforcement learning and in orderto perform numerical integration. Not only is this discretization critical for the accuracy of thesolution, but generally also for stability of the integration and convergence of learning. Matlab’sode45 solver has been used to integrate the system and a fixed linear discretization has beenselected for the reinforcement learning scheme.

Simulation 1: random initialization, no action selection

The purpose of the first simulation is to demonstrate that the basin of attraction can be visualizedfrom the action-value function and matches the theoretical basin of attraction. In this simulationwe are only interested in the passive dynamics, so no control input. Still reinforcement learningmakes sense in this simulation, not for control purposes, but to demonstrate that only states fromwhich the goal region will eventually be reached get an update if the only nonzero reward is givenwhen the system arrives in a goal state. The system receives a reward when it arrives in someneighborhood of one of the known equilibria, namely (x1, x2) = (a, 0).

The result of the simulation is shown in Figure 5.3a. As expected, all states from which the goalstate, corresponding to the equilibrium (x1, x2) = (a, 0), can be reached obtain a max(Q)-valuelarger than the maximum random initialization value. States closer to the equilibrium receive thelargest max(Q)-value. So by this selected reward scheme the Q-table readily contains informationabout the basin of attraction of the equilibria because all entries having a high max(Q)-value aremembers of the domain of attraction and all others are not. More generally in can be stated thatthe Q-table contains information about the basin off attraction to the rewarded states. This is avery straightforward observation, not new, but utilized here in a valuable way.

Simulation 2: random initialization, action selection

The second simulation differs from the first simulation in that action selection is applied. Thepurpose of this simulation is to demonstrate that Q-learning enlarges the region of attractionand that also when control is applied the action-value function can be utilized analogously tovisualize the basin of attraction. The dynamics and the reward scheme remain the same, but inthis simulation the system has the possibility to enlarge the basin of attraction of the equilibrium(x1, x2) = (a, 0) by control. The result of the simulation is shown in Figure 5.3b. The basinof attraction is enlarged in correspondence with the theory. If the allowed control inputs aresufficiently large the basin of attraction can be made arbitrarily large.

Some observations and implications

This simulation shows that for this system it is indeed possible 1) to reconstruct the basin ofattraction from the action value function to some predefined goalstates and 2) that the learningcontroller makes these goalstates equilibria of the controlled system. This means that the desiredcontrolled dynamics could be specified in the reward scheme by definition of the goal states. Inthis system convergence to the desired dynamics was possible. However there are systems forwhich this is not possible, for example if part of the state space is uncontrollable or unobservable.

The advantage of rewarding only the goal states is that the dynamics can be observed from theaction-value function. A disadvantage of this approach is that, especially early in the learningprocess, progress of learning is slow because first the the goal state has to be reached before any

46

Page 47: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

5.2 Balance interpretation

−0.1−0.05

00.05

0.1

−2

−1

0

1

20

20

40

60

80

100

120

x2x1

max

(Q)

(a) basin of attraction of the (x1, x2) = (a, 0) equilib-rium without control (action-value function max(Q)).

−0.1

−0.05

0

0.05

0.1

−2−1

01

20

100

200

300

400

500

600

700

800

x2x1

max

(Q)

(b) basin of attraction of the (x1, x2) = (a, 0) equi-librium with control (action-value function max(Q)).

Figure 5.3: experiments 1 and 2 Qubic-system.

learning takes place. The probability of reaching the goal state is generally low. Therefore, mostly,a huge numbers of trials have to be executed before learning becomes effective. Fortunately, thisdrawback can be easily overcome. Often it is possible to drive to system to the goal state by someheuristic controller. This controller can drive the system to the goal state a few times such thata action-value function fills and that the learning controller can proceed. Another approach is touse a second action-value function in parallel, having a different reward scheme. The first action-value function and reward scheme is then only used to monitor the dynamics and effectivenessof the second reward scheme. It is important to note that it is impossible to observe the dynamicsfrom the action-value function when multiple kinds of rewards are allowed to build one action-value function, because information about the source of the reward is not stored, only the value.Therefore, a kind of separation between reward schemes is required.

A different observation, which is not related to the topic of reconstructing dynamics but might beworth mentioning, is that it is harder to compensate for the natural dynamics in order to achievesome goal in some parts of the state space than it is in other parts (see Figure 5.1a). In the examplethe probability that reinforcement cancels the natural dynamics is smallest at the most negativevalue of x1 at negative x1 because here the set of inputs that achieves compensation of the naturaldynamics is smallest. The number of action sequences that propel the system through this funnelis simply very small compared the total number of possible action combinations. Most likely thiskind of combinatory bottleneck is present in many kinds of systems. In order to compensatethem more quickly it could be helpful if sometimes a sequence of high inputs all pointing in athe same direction is selected as exploratory action.

5.2 Balance interpretation

In this section it is shown that the domain of attraction to balance can be approximately recon-structed from the action-value function if a suitable reward scheme is defined. The discussion isvery similar to the interpretation of the domain of attraction of goal states as discussed in Section5.1.

47

Page 48: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Reinforcement learning and dynamics interpretation

5.2.1 Compass walker example

This subsection continues with the compass walker from Section 3.2.1 [31] (Figure 2.2). It isdemonstrated that the largest domain of attraction to walking can be reconstructed from theaction-value function in a similar way as the domain of attraction to some goal states can bevisualized for the qubic-linear system example. The availability of knowledge about the maximallyachievable domain of attraction for this system enables some kind of verification of our results.

The equation of motion of the compass walker system (Equations 2.31 to 2.34) are integrated byMatlab’s ode45 integration scheme. In order to assure the Markov condition which is requiredfor convergence of reinforcement learning, the routine is called at fixed time steps and checksthe impact conditions after every time step. The episode is terminated if the goal region hasbeen reached or if a the walker falls. The Matlab ode45 algorithm with standard event detectionsometimes executes backward steps in the neighborhood of events, which we do not allow. Thesystem receives a reward when it accomplishes a step. More precisely, when it arrives in somepredefined end of step states. As a result, all states that obtain a maximum Q-value higher thanthe maximum random initialization value are states from which completion of a step is possibleand the basin of attraction of walking can be reconstructed (conjecture 5.1).

Because the compass walker has four degrees of freedom (two angles and two angular velocities),only part of the state can be displayed in one image. Fortunately, there exist a meaningful 2Dintersection of initial conditions through which the walker always passes by definition while itperforms its limit cycle motion: the double stance. This intersection has been used in [31] aswell to study the domain of attraction of walking. Here this intersection is adopted such thatour results are comparable to theirs and can be visualized. Figure 5.4 shows how the angles arelinked. As a result of this linkage only one angle needs to be specified, e.g. θ. Additionally, thepost impact velocity is only a function of the pre-impact angular velocity θ, thus only θ is required.

θ

φ

φ = 2θ

Figure 5.4: at post impact double stance only two degrees of freedom are left. The two anglesand two angular velocities are related.

Just like in the Qubic-linear example the problem needs to be discretized. The size of the dis-cretization is restricted as a high number of states significantly increases computation time. θ individed in 24 states ranging from − 7

12π to 712π, φ in divided in 24 states ranging from −7

6π to76π, θ and φ are divided in 17 states ranging between −1.5 to 1.5. The control input is divided in19 levels between −15 en 15. The distribution of discrete states is nonlinear in order to decentlydetect collision and limit the number of required discrete states. A larger number of linear dis-tributed states was out of reach of the current learning scheme (only because of computationalintensity). It is important to notice what kind of consequences the current discretization has. Theresult is a state space of 172 · 242 · 19 entities (3, 163 · 106). Assume a step of the walker takes onesecond and a control input is applied every 0.1 second, then there exist (3, 163 · 106)10 possible

48

Page 49: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

5.2 Balance interpretation

combinations (1.002 · 1065). This has to be repeated for a large number of initial conditions yield-ing even more possibilities. The problem of dimensionality is obvious and there clearly exists aneed for a bootstrapping learning method as deep search cannot be performed within a decenttime span.

The two kinds of simulation which were performed with the Qubic-linear system are repeated.One simulation without control which involves about 10000 10-step episodes (Figures 5.5a(theory),5.5b(reinforcement learning result) and one simulation with control (Figures 5.6a(theory), 5.6band 5.7a(Q-tabel reinforcement learning result)) which involves about 5000000 10-step episodes.The colored regions in these graphs indicate the highest expected reward (max(Q)).

(a) passive compass walker basin ofattraction (source: [31]).

θ

θ

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

1

2

3

4

5

6

7

8

9

10

11

12

(b) passive compass walker basin of attractionreconstructed from the action-value function.

Figure 5.5: experiment 1 compass walker.

(a) actuated compass walker max-imum basin of attraction (source:[31]).

θ

θ

0 0.2 0.4 0.6 0.8 1−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

50

100

150

200

250

300

350

400

450

500

550

600

(b) actuated compass walker maximum basinof attraction reconstructed from the action-value function.

Figure 5.6: experiment 2 compass walker.

All non-blue colored states in the figure have received a reward during the learning process, there-fore the goal region is reachable from these states. Most non-blue regions lie in the theoretical

49

Page 50: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Reinforcement learning and dynamics interpretation

maximal basin of attraction. However, the boundaries of these region are crossed. This cannotbe the result of an intermediate learning state, since these solutions can never be found. How-ever, this can be at least partly explained by the observation that many discrete regions lie bothinside and outside the largest basin of attraction to walking. These discrete regions may correctlyreceive a reward, leading to some overlap. To make things worse, during some subsequent trialthe system might arrive in this state from an initial conditions outside the basin of walking, andas a result some other state incorrectly receives a reward. Due to this proces, expected returnsmight diffuse through the boundary. This could be prevented if in advance knowledge about thesystem is used, but that is highly undesirable as learning is about exploring the system withoutprior knowledge, including imperfections.

The learning curve shown in 5.7b) seems to approach some limit which is an indication of con-vergence. Often this limit is known in advance, in our case it is only approximately known,because the goal state cannot be reached from every initial condition. The maximum averagereward per episode must be estimated because for some initial conditions no steps are possibleand some random actions continue to be executed. Probably the maximum reward per episodewill be about 150. Therefore learning has probably not yet completed after these 500000 10-stepepisodes. This simulation already took five days to complete.

00.5

11.5

2 −1.5

−1

−0.5

0

0

100

200

300

400

500

600

700

θθ

max

(Q)

(a) actuated compass walker maximum basin of attrac-tion from Q-table 3D surface plot.

0 2 4 6 8 10 1210

20

30

40

50

60

70

80

set of episodes

aver

age

rew

ard

(b) learning curve after 500000 episodes.

Figure 5.7: compass walker 3D plot Q-space double stance slice and the learning curve.

An interesting question is how the reinforcement learning controller compares to PD controlas applied in [31]. In Figures 5.7a and 5.7b the performance of PD control for three differentgains as found by [31] can be compared with the results of the reinforcement learning scheme.The conclusion is that the reinforcement learning controller fills in a larger part of the maximallyobtainable region of attraction, especially for small gains. In principle when learning is completedit should be able to fill in the whole region of attraction. What is interesting to mention is thatthe way the basin of attraction is filled in during learning largely corresponds to the way the basinof attraction expands using increasingly higher gains in PD control. This corresponds with theexpectations because large control inputs are relatively rare, thus the number of control inputsthat are suitable to control the walker in the high velocity region is smaller which makes them

50

Page 51: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

5.3 Conclusion

harder to find by learning. This is the same phenomenon observed in the Qubic-linear examplewhere also in some part of the state space only a limited number of control inputs could steerthe system to desired goal state. Furthermore initial conditions corresponding to large steps arefound earlier because small steps require higher control inputs (and thus less combinations).Again this corresponds to the PD controller controlling the large step trajectories first as can beseen in the Figures 5.7a and 5.7b. This corresponds to the expectations, because a higher controlgain generally results in higher control inputs in some part of the motion.

(a) PD controller basin of attractionfor (critically damped) (1) K = 25(2) K = 50 and (3) K = 100(source: [31]).

θ

θ

0 0.2 0.4 0.6 0.8 1−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

50

100

150

200

250

300

350

400

450

500

550

600

(b) actuated compass walker maximum basinof attraction reconstructed from the action-value function.

Figure 5.8: experiment 2 compass walker.

5.3 Conclusion

In previous works on reinforcement learning generally the interpretation of the action-value func-tion gets little attention. Also, the majority of literatures focus on discrete systems. This chapterhad shown how the reward function can be designed such that the action value function gets aninterpretation in terms of dynamics. That is, equilibrium sets and their domain of attraction.Hereto it is required that the reward function is designed according to some strict rules.

Firstly, it has been demonstrated that it is indeed possible for an example system to reconstructthe basin of attraction to some predefined goal states and that these goal states become equilib-ria of the controlled system. The closed loop desired dynamics can be specified in the rewardscheme, i.e. the closed loop goal states. Our conjecture that the basin of attraction can be recon-structed holds for this example system. In general it depends on the nature of the system whetherlearning will converge to the desired dynamics. More precisely, the task the system should per-form must be achievable. For example, if part of the state space is uncontrollable or unobservable,convergence to the target dynamics may be impossible.

Secondly, a meaningful interpretation of the action-value function in terms of the domain ofattraction to a balanced motion has been introduced. Hereto some restrictions constrain thereward function. The resulting controller largely fills is the maximal basin of attraction, whichwas available from the literature, to a set of end-of-step postures. However, part of the results

51

Page 52: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Reinforcement learning and dynamics interpretation

do not correspond with the expectations. This mismatch is probably caused by discrete regionslying both inside and outside the maximal basin of attraction to balance which enables diffusionof expected returns through the boundary of the maximal region of attraction.

Still there are other ways to locomote than walking. Other balanced motions are for examplerunning or just standing. Chapter 6 presents a way to compute a more complete set of balancedmotions.

52

Page 53: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Chapter 6

Compound reinforcement learning:dynamics interpretation

In Chapter 5 it has been demonstrated that the domain of attraction to some set of target statescan be reconstructed from reinforcement learning. Also, in Chapter 5 it is has been demonstratedthat the domain of attraction to some walking motion of the compass walker can be reconstructedby translating walking into reaching a particular end-of-step target region. However the usage ofonly one target region has many limitation. Therefore this chapter introduces a multi targetregion method.

In the literature subtask methods are named modular or hierarchical reinforcement learning.An important theoretical work on hierarchical reinforcement learning is [4]. Earlier works onhierarchies for wheeled soccer robots are [16], [2]. The development of hierarchies for legged-locomotion is still a very young field of research, but [30] has developed a multi-agent approach,which might be interpreted as a kind of hierarchy. Dividing large tasks into a sequence of smallersubtasks reduces learning effort in several ways. Firstly, subtasks can be learned in a lower di-mensional subspace, or in a certain part of the state space. Secondly tasks can be assembledfrom the subtasks without the need to relearn the the larger new task as a whole. Thirdly, cyclicmotions can be designed.

The principle of splitting up larger tasks into sequence of smaller tasks is not new. However, ourapproach, compound RL, differs somewhat form the other approaches and has three particularlyrelevant properties. Firstly, our approach enables reconstruction of the basin of attraction fromthe individual tasks. The reward scheme within every subtask is restricted such that the basin ofattraction to the individual target regions can be reconstructed like in Chapter 5. Secondly, thisbasin of attraction interpretation allows computation of connectivity between the subtask. Mostother methods require every combination of subtasks to be learned explicitly. Thirdly, our methodhas the property that the target regions can be reduced without the need to relearn the wholetask. These properties make compound RL potentially interesting to learn balance in leggedlocomotion. We propose a way to define subtasks such that it may become possible to learnbalance in a very general way and tested the principle on a bouncing ball system.

This chapter is organized as follows. Section 6.1 explains compound motions in detail. Section6.2 discusses compound RL by an example system. Section 6.3 explains how the balance problemcan be framed in terms of reaching a set of goal-regions and how this is related to compound RL.

53

Page 54: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Compound reinforcement learning: dynamics interpretation

6.1 Compound motions

Figure 6.1a shows schematically two example compound motions. Herein B is te basin of attrac-tion and G the goal region. Both the basin of attraction and the goal region depend on the activecontroller. The subscript denotes the active subtask, n ∈ N. The system is initialized in someinitial state and the first controller drives the system to the first goal region. When the first goalregion is reached, the second controller is activated which drives the system to the second goalregion, etc. It is possible that in some region of the state space multiple subtasks are achievable.This situation is illustrated in Figure 6.2.

Bn

Gn

Bn+1 Gn+1InitialState

(a) simple compound motion.

Bn

Gn

Bn+1 Gn+1InitialState

Bn+2

Gn+2

(b) cyclic compound motion.

Figure 6.1: schematic representation of compound motions.

Bn

Gn+2

InitialState

Gn+1

Bn+2

Bn+1Gn

Figure 6.2: the state is in the domain of attraction of two out of three subtasks.

These compound motions use the result from Chapter 5 wherein a particular strict reward schemewas used such that the basin of attraction of some goal region can be reconstructed from theaction-value function. This enables us to determine the (largest possible) basin of attraction tovery complex acyclic and cyclic motions. Meanwhile, a controller is learned that fills it. Alsoit enables computation of connectivity between the subtasks as will be explained in subsections

54

Page 55: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

6.1 Compound motions

6.1.1 and 6.1.2. To the best of our knowledge other methods in literature require the designer tospecify connectivity between the modules or to learn it.

6.1.1 Simple connectivity

Given the goal region and the basin of attraction of two subtasks, connectivity is verified by check-ing whether the goal region of the first subtask lies within the basin of attraction to the secondsubtask. Subtasks can be connected if:

Definition 6.1 (Subtask connectivity condition). Gn j Bn+1.

The subtask connectivity condition can be verified by systematically checking wether all elementsin Gn are also elements of Bn+1.

In many cases, the subtask connectivity condition is not satisfied as illustrated in Figure 6.3. Also,while the system has not yet completed learning this conditions may not be satisfied even thoughthe condition will be satisfied when learning is finished. In these cases it can still be possible toconnect the subtasks if the goal regions can be reduced such that after the reduction the subtaskconnectivity condition is satisfied. However, computing these reduced subsets requires someadditional effort. Subsection 6.1.2 explains how this can be done.

Breducedn Bn+1

Gn+1

Gn Greducedn

Bn

Figure 6.3: compound motion which has a goal region partly outside the basin of attractionto the next goal region.

6.1.2 Complex connectivity: compound RL

In general not for every subtask connection the subtask connectivity condition holds and addi-tional effort is required to investigate whether some sequence of subtasks can be executed. Ourapproach basically is to find a reduced goal region for every connection for which the conditiondoes hold. Consider the situation depicted in Figure 6.3. A reduced goal region Greducedn can befound simply by the intersection of Bn+1 with Gn:

Greducedn = Bn+1

⋂Gn. (6.1)

This reduced goal region Greducedn has also a reduced basin of attraction Breducedn . This affects all

other connections earlier in the sequence.

55

Page 56: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Compound reinforcement learning: dynamics interpretation

In order to compute Breducedn , it is necessary to know which expected return relates to which

part of the goal region. The classical action-value function does not incorporate this information.Therefore, an adaptation is proposed. Instead of a reward value we propose to store the statein which a reward is received. As there can be multiple goal states, every entry in the actionvalue function then consists out of a list containing the goal states that can be reached. Theaction-value function hereby loses it’s original meaning. We therefore instead name it the M -table. After every action an form state sn the list in the M -table at (sn, an) is expanded with allelements in the list contained in sn+1 for all an+1. The new reduced basin of attraction does notneed to be relearned, but can be determined by only enabling those states in the M -table thathave elements of the reduced goal region Greducedn in the list.

As the basin of attraction Breducedn is always smaller than or equal to Bn, it is possible that earlier

connections lose the subtask connectivity condition. Connections later in the sequence are notaffected. Therefore, only connections earlier in the sequence have to be checked. An algorithmthat checks all connections should begin with the latest subtask in the sequence and go backwarduntil all connections are reduced and checked. If any Greducedn becomes empty, the connectionbetween the subtasks is impossible. Otherwise, when all connections are checked and noGreducedn

is empty, then the compound motion is possible. Algorithm 6.1 follows this procedure. If cyclictasks are considered, it can be concluded that the cyclic motion is possible if during one cycle allconnections satisfy the subtask connectivity condition (possibly after computing some reducedsubtasks), see Algorithm 6.1). Therefore, in many cases there is no need to check an infinitenumber of connections in cyclic motions.

n−1 n n+1mode N

update direction

Figure 6.4: the sequence connectivity algorithm should check the conditions backward, begin-ning with the last subtask, because the reduced basins of attraction only have an effect onearlier connections.

Algorithm 6.1 subtask connectivity check algorithm non-cyclic motions1: popout = false2: n = N3: while ¬popout do4: n = n− 1;5: Greducedn = Bn+1

⋂Gn

6: if n ≤ 0 then7: popout = true8: output: all connections possible9: end if

10: if Greducedn = ∅ then11: popout = true12: output: subtask connection not possible13: end if14: end while

56

Page 57: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

6.2 Multi-surface bouncing ball example

Algorithm 6.2 subtask connectivity check algorithm cyclic motions1: popout = false2: k = 03: ConnectionV ector = ZeroV ector(N × 1)4: while (k < K) do5: k = k + 16: n = N7: if SumofElements(ConnectionV ector) = N then8: output: all connections possible9: Exit program

10: end if11: ConnectionV ector = ZeroV ector(N × 1)12: while ¬popout do13: n = n− 1;14: Greducedn = Bn+1

⋂Gn

15: if Greducedn = ∅ then16: popout = true17: output: subtask connection not possible18: Exit program19: end if20: if Greducedn = Gn then21: ConectionV ector(n, 1) = 122: end if23: end while24: end while

The additional memory requirement for this method can be quite severe due to the need for aseparate extended action-value function (M -table) for every substask. Therefore, it could be ques-tioned wether it is really worth the effort. But, to the best of our knowledge there exists no betterway to solve this issue. Some might argue that it is possible not to care about the connectivity andjust to learn connectivity by trial and error. However, learning connectivity faces the same prob-lem, namely that connectivity critically depends on all preceding and future subtasks, and thusrequires knowledge of all other subtasks. Section 6.2 considers an example system on which theproposed compound RL is demonstrated.

6.2 Multi-surface bouncing ball example

This section considers an example of a bouncing ball which can bounce on three surfaces. Thecompound RL-approach can be applied to this example and it can be demonstrated how complexconnectivity works. Before proceeding with the more complex multi-surface bouncing ball systema simple bouncing ball is considered to show how the attraction domain and goal region dependon the restitution coefficient (Figure 6.5a).

57

Page 58: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Compound reinforcement learning: dynamics interpretation

6.2.1 Bouncing ball

The dynamics of the bouncing ball (Figure 6.5a) is given by:

x = −g , ¬(x = 0 ∧ x ≤ 0) (6.2)x+ = −εx− , x = 0 ∧ x ≤ 0. (6.3)

Herein + denotes a post-impact value and − denotes a pre-impact value. The goal region is thehalf-line:

G = (x, x) ∈ R2x = 0, | x > c, (6.4)

where the constant c > 0 and the restitution coefficient ε ≥ 0. Figures 6.5b to 6.5d show theregions of attraction and the goal regions corresponding to different values ε. If ε < 1 part ofthe goal region G lies outside the basin of attraction B. Consequently, it is not guaranteed thata sequences of subtasks (just bouncing without control) can be repeated a number times even ifthe initial condition was in the basin of attraction B. Actually it is known for this simple systemthat there will always exist an instant that the goal region cannot be reached anymore for anyc > 0. This situation can be compared with walking down a slope passively which is not steepenough, every step the walker loses some energy and eventually the walker will fall and come toa standstill.

xmg

(a) schematic rep-resentation of thebouncing ball sys-tem.

0 2 4 6 8 10−15

−10

−5

0

5

10

x

x

B

not B

B

G

(b) ε < 1.

0 2 4 6 8 10−10

−5

0

5

10

x

x

B

B

not BG

(c) ε = 1.

0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

x

x

B

B

not B G

(d) ε > 1.

Figure 6.5: bouncing ball and three cases having different ε on which connectivity depends.B is the domain of attraction to goal region G

58

Page 59: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

6.2 Multi-surface bouncing ball example

6.2.2 Multi-surface bouncing ball

This subsection proceeds with a system which has a set of different subtasks to demonstrate thecomplex connectivity case of compound motions.

The system

Here the example of a bouncing ball is considered again, but this time there are three differentsurfaces S beneath the ball between which can be switched if x = 0 (and x > 0) by an controlinput u. The control input is quantized, it can only switch between surfaces. The dynamics are:

x = −g , ¬(x = 0 ∧ x ≤ 0) (6.5)x+ = −ε(S)x− , x = 0 ∧ x ≤ 0 (6.6)

S = 0 , ¬(x > 0 ∧ x = 0) (6.7)S+∗ = S−∗ + u(x, x, S) , x > 0 ∧ x = 0. (6.8)

+ Denotes an post impact value, − a pre-impact value, +∗ denotes an post switch value and −∗ apre-switch value. The surfaces have distinct restitution coefficients:

ε = e1 = 0.8 , S = 1ε = e2 = 1 , S = 2

ε = e3 = 1/0.8 , S = 3.

The extend to which the ball can be steered by the control input is restricted and depends on theheight at the switch instant x = 0 (and x > 0) in the following way:

u ∈ 0 , x < L1

u ∈ −1, 0, 1 , L1 ≤ x < L2

u ∈ −2,−1, 0, 1, 2 , x ≥ L2.

Wherein L1 and L2 denote certain height levels. In addition u is bounded such that min(S) = 1and max(S) = 3. Thus the height is divided into a fixed number of levels. Depending on theheight relative to the levels, the surface is allowed to move to the left or to the right by the inputu. Figure 6.6 schematically shows the system.

S 1 2 3

L2

L1

e1 e2 e3

u

x

Figure 6.6: schematic representation of the multi-surface bouncing ball system.

59

Page 60: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Compound reinforcement learning: dynamics interpretation

Objective

The objective of the system is defined as tracing a predefined sequence of surfaces. This is calledthe root task. The root task is divided into a number of subtasks M , being: go form surface 1 tosurface 2 (M1→2), go from surface 3 to surface 1 (M3→1), etc. In this example system there arein total 9 of these subtasks. So in every subtask the controller learns how to reach a particularsurface. The goal regions of these subtasks are essentially the whole state space above the nextsurface. However, the controller can only act when the ball reaches its maximal height during abounce (levels L1, L2). This restricts the domain of attraction. It is clear that these elementarysubtask can be concatenated to achieve any root task independent of it’s complexity, althoughsome sequences might never be possible because the ball will drop below L1. If these subtaskswere not defined, learning would have to start again for every root task. Therefore, especiallywhen there are many or large root tasks it is beneficial to divide the root task in combinablesubtasks.

The question which sequences are possible has our special interest as that is what Algorithm 6.1computes. Essentially we want to know, given any sequence, if the sequence can be executed. Inaddition we want to know the attraction domain of that sequence using some particular controlleror any controller. Some mode sequences will not be possible because the ball will drop below L1,other mode sequences can be repeated for ever, for example if the restitution coefficients arechosen such that their product is 1.

The example sequence

In the remainder one particular sequence is studied: from S = 2 to S = 3 to S = 1 and to S = 2again (see Figure 6.7). Or stated differently, M2→3-M3→1-M1→2. This sequence can be repeatedmany times. The levels L1 and L2 have values 3 and 6, respectively. M2→3 and M1→2 require asingle shift, M3→1 requires a double shift.

S = 2e2 = 1

S = 3 S = 1e2 = 1

0.8e2 = 0.8

t

M3→1 M1→2M2→3

Figure 6.7: example of a desired sequence (from S = 2 to S = 3 to S = 1 and repeat).

Simulation of the example sequence

Three cycles of the example mode sequences have been simulated which is shown in Figure 6.8.The Matlab ode45 routine with event detection is used to solve the dynamics. The control law isu = Sdes

+∗−S−∗ wherein Sdes+∗ is the target surface. The product of the restitution coefficientsis 1. As a result, if initiated at a sufficient height, the motion can continue forever.

60

Page 61: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

6.2 Multi-surface bouncing ball example

0 5 10 15 20−10

0

10

tx

0 5 10 15 20−20

0

20

t

x

0 5 10 15 200

2

4

t

S

Figure 6.8: bouncing ball simulated multi-surface motion (from S = 2 to S = 3 to S = 1 andrepeat).

Connection properties example system

In a first step the modules are learned independently. M2→3 and M1→2 require a single shift,M3→1 requires a double shift. Which means that the height which must be reachable from anystate within these modules is L1 for M2→3 and M1→2, L2 for M3→1. This will be visible in thebasins of attraction (solid lines in Figures 6.12a, 6.12b, 6.12c). However, if the substasks areconnected in a particular sequence, it may happen that the minimum heights are actually higheras a result of some next subtask. The example system exhibits this behavior as depicted in Figure6.9. In order to reach L2 in M3→1 the minimum heights at the switch instant in M2→3 andM1→2 are 4.8 instead of L1 (=3.0). Thus the basin of attraction is different. This situation is thecomplex connectivity case from Subsection 6.1.2.

S = 2e2 = 1

S = 3 S = 1e2 = 1

0.8e2 = 0.8

t

M3→1 M1→2M2→3

L2 = 6

L1 = 3 L1 = 3

> 6.0

> 4.8 > 4.8

Figure 6.9: bouncing ball multi-surface motion (from S = 2 to S = 3 to S = 1 and repeat)connection properties.

61

Page 62: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Compound reinforcement learning: dynamics interpretation

Obtaining Greduced and Breduced

In order to deal with complex connectivity, Q-learning as applied in Chapter 5 must be modifiedto enable computation of the reduced basin of attractionBreduced. The simple rule 5.1 proposed inChapter 5 is not sufficient because it cannot decide anymore which large Q value belongs to whatpart of the goal region. It is required to know which part of the goal region the expected return isrelated to. This is depicted in Figure 6.10. The goal region G spans a range of x-values. All greycolored values were feasible in the disconnected subtask. However, in this particular connectiononly the darker grey values are feasible yielding a Greduced .

S = 1e2 = 0.8

t

M1→2

L1 = 3

x > L1

1

2

3

4

5

6

7

G : x = 0, x > 0

Figure 6.10: M1→2 module with goal region schematic.

The compound motion algorithm stores the feasible values in a list at every entry (x, x, u) of theM -table. This is depicted in Figure 6.11. This M -table, representing B, is filtered by the values inGreduced to obtain Breduced. How the basins reduce in the example sequence is shown in Figure6.12a, 6.12b and 6.12c.

−1

10

2

−2

u

x

x 54

7

3

array at (x, x, u)containing x-values

Figure 6.11: grid of M -table schematic.

Bnreduced can be determined using Gn+1

reduced. By doing this systematically according to Algo-rithm 6.2 or 6.1, it can be predicted wether a certain sequence of modules can indeed be executedas follows. Firstly, starting form the last subtask or an arbitrary one in case of a cyclic motion,we remove the part of the goal region that lies outside the domain of attraction to the next goalregion. Secondly, we recompute the region of attraction to the smaller goal region. This pro-cess is repeated for the next connection (backward) in the sequence until the whole sequence

62

Page 63: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

6.2 Multi-surface bouncing ball example

0 2 4 6 8 10−10

−5

0

5

10

x

x not B B

(a) M2→3.

0 2 4 6 8 10−10

−5

0

5

10

15

x

x Bnot B

(b) M3→1.

0 2 4 6 8 10−15

−10

−5

0

5

10

x

x

B

not B

(c) M1→2.

Figure 6.12: borders of theoretical basins of attraction. Solid line denotes not connection case,dashed line denotes the connected case. B denotes the basin of attract and not B outsideregion of attraction

has been checked. If none of the goal regions becomes empty the motion is possible and if oneregion becomes empty the motion is not possible. In the case of cyclic motions, however, it isimpossible to check the whole sequence as it is an infinite sequence. Fortunately, if in one cycleof adjustments no region needs te be shrinked, the cycle can continue forever because all targetregions lie within the basins of attraction. Again if one target region becomes empty the roottask is not possible. The connection check algorithm has been programmed in Matlab. Given aparticular sequence the algorithm outputs true, they can be connected, or false, they cannot beconnected. In addition to the possibility of te connection the domain of attraction Breduced inevery subtask can be determined explicitly. This has been done for the example sequence andis shown in Figures 6.13a to 6.13f. At negative velocities some error is present, this is the resultof the Matlab ode45 event detection. The event instant is not always accurately predicted leadingsometimes to a higher velocity after impact. Even a small error leads to an incorrect update of theM -table which can propel in backward time (thus in the negative velocity region). The positivevelocity regions show no significant deviations from the theoretical result.

63

Page 64: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Compound reinforcement learning: dynamics interpretation

x

x

0 2 4 6 8 10−15

−10

−5

0

5

10

15

(a) M2→3.

x

x

0 2 4 6 8 10−15

−10

−5

0

5

10

15

(b) M2→3.

x

x

0 2 4 6 8 10−15

−10

−5

0

5

10

15

(c) M3→1.

x

x

0 2 4 6 8 10−15

−10

−5

0

5

10

15

(d) M3→1.

x

x

0 2 4 6 8 10−15

−10

−5

0

5

10

15

(e) M1→2.

x

x

0 2 4 6 8 10−15

−10

−5

0

5

10

15

(f) M1→2.

Figure 6.13: borders of basins of attraction determined by simulation. The left figures repre-sent the disconnected case and the right figures represent the connected case. The white linedenotes the theoretical boundary of the attraction domain, the brown colored regions indicatethe attraction domain (here ∃u s.t. ¬(M(x, x, u) = ∅))

64

Page 65: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

6.3 Balance and compound motions

6.3 Balance and compound motions

Section 5.2 has shown for the compass walker example that the basin of attraction of a set of end-of-step postures can be reconstructed and that the reinforcement learning controller can steerthe system to that set. However useful, this result is still quite far from the general definition ofbalance. The general balance condition states that some set of states, that correspond to a fall,must be avoided (see Definition 3.2). This section bridges the gap between the two and derives asubtask division for legged locomotion.

We convert the problem of avoiding a set of falling states (balance Definition 3.2) into a prob-lem of reaching a finite set of goal regions. Firstly, we argue that falling states can reasonablybe defined as the closure of some particular contacts, because in general, when something falls,undesirable contact with the ground is made by some part of the object (definition 3.1). In the hy-brid systems framework and complementarity Lagrangian framework, modes are distinguishedbased on whether or not contacts are closed. Therefore, this framework provides informationwhether or not the system is in a fall mode. Secondly, we argue that avoiding a set of fall modesequals being able to reach any non-falling mode within a single modetransition. As the numberof contact modes defined in the hybrid system (Section 2.1) is finite, balance can be completelydefined in this way. The total the number of contact modes equals the sum of the fall contactmodes and the non-fall contact modes.

More formally, let Qf be the collection of all falling modes qfi , Qg the collection of all non-fallingmodes qgj and Q all modes. It holds that Qf

⋃Qg = Q. Avoiding Qf can also be defined as:

Proposition 6.1. Being in a s(t) ∈ qg, qg ∈ Qg, if any subsequent mode transition leads toa s(t+ T ) ∈ qg, qg ∈ Qg, the state will never arrive in Qf .

This statement is useful as it allows to translate the problem of avoiding certain modes to aproblem of reaching any mode out of a finite number of modes in a single-modetransition. Thegoal regions G equal the guards G as defined in the hybrid system framework as they specify themode transitions. All guards belonging to transitions from non-falling to non-falling modes aregoal regions.

Consider, for example, the compass walker in Figure 6.14 which has three contacts. Let a fallbe defined as the closure of contact three. The walker can be in 8 modes of which 4 involve theclosure of contact three. So there are, being in any non falling mode, three options to go to a newtarget mode in addition to the option to remain in the current mode. All these options togethercharacterize balance, because if the current state is in at least one of the basins of attraction tothe different target modes or if the current mode can be maintained, a balanced transition ispossible. The situation is illustrated in Figure 6.15. Note that in this Figure only 4 regions canbe displayed that are connected to all other regions in 2D. Therefore, all modes that involve theclosure of contact contact 3 are gathered in the Fall region. These Fall modes can be reached fromany other region. The transitions to and these fall modes are depicted by arrows in and out of theplane, dots and a crosses, respectively.

Being in mode xxo, double stance, by these compound motions the walker in principle is notonly able to make a simple step xxo oxo xxo, but also, for example, a run which includes a flightphase xxo oxo ooo oxo oxx or a backward step. Actually in theory any kind of motion which doesnot include the closure of contact three. In practice, computational intensiveness keeps us fromactually calculating all these possible motions. However, it is quite likely that the computation is

65

Page 66: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Compound reinforcement learning: dynamics interpretation

achievable because the learning of every subtask can be executed independently and therefore inparallel on multiple platforms which can afterwards share their learned subtasks. It has demon-strated in the previous Chapter that one subtask could be learned, although it did not incorporatesome required additional computation to establish connectivity. Connectivity calculations for asimple system were performed and were discussed in Subsection 6.2.

contact 1 2 3mode 1 x x xmode 2 x x omode 3 x o xmode 4 x o omode 5 o x xmode 6 o x omode 7 o o xmode 8 o o o

C3

C2C1

Figure 6.14: compass walker contact definitions. x Denotes a closed contact and o an opencontact

xxooxo xoo

oooFall

Figure 6.15: schematic example of single step balancability. The systems begins in mode 2:xxo. Transition between all modes is possible. All four falling modes have been collected inFall. Transitions to and from fall sates are denoted with dots and crosses representing arrowsout and into the plane.

In total 8× 8 kinds of mode transitions are possible (including staying in the same mode). How-ever, there are only 4 × 4 mode transitions between non-falling modes, which are all possiblebalanced transitions. Therefore, we define the set of subtasks as the set of mode transitions be-tween non-falling modes. There are 16 of these in this example. The target regions are definedas the whole state space of the subsequent mode and the basin of attraction to that mode lies inthe current mode. Note that no target regions overlap because every mode is in a separate part ofthe state space. Also, not from any state in some mode any other mode is reachable. Therefore,the subtask connectivity condition will in general not be satisfied.

The basin of attraction computed within each subtask to the target region establishes whether asingle-transition is possible, because the basin of attraction within one module is only related toreaching the next mode. It can be verified which of the 4 available subtasks are feasible, given

66

Page 67: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

6.3 Balance and compound motions

any initial condition in any mode, by simply checking if the initial condition is in one of thebasins of attraction of the controlled systems that steer the system to a next mode (the situationis analogous to Figure 6.2, if for example two subtasks are possible).

If the state is in the basin of attraction of one of the goal regions, only single-transition balance isguaranteed. It does not guarantee that balance can be maintained after that transition unless alltarget regions themselves lie completely within the basin of attraction of all possible next feasibletransitions, which will in general not be satisfied. It is likely that some sequences of transitionswill be possible, while others will not be possible. Therefore, it would be very interesting indeedif we could compute which mode sequences are possible and which are not, without the need tolearn this for every possible sequence.

6.3.1 Integration of balance control in a larger scheme

Balance is not the only requirement and objective of legged machines. Balance could be perceivedas a constraint to the design of motions. Therefore, this section discusses some ways how it couldbe integrated into the control scheme as a whole.

Let Q be any action value function obtained by Q-learning. The M tables, representing for exam-ple balance, can be used to constrain action selection in the following way:

a = arg maxa

(Q(s, a)), a ∈ A s.t. ¬(M(s, a) = ∅) if (p > ε) ∧ (∃a ∈ A s.t. ¬(M(s, a) = ∅))a = arg max

a(Q(s, a)), a ∈ A if (p > ε) ∧ ¬(∃a ∈ A s.t. ¬(M(s, a) = ∅))a = random(A) if (p ≤ ε). (6.9)

Which is similar to epsilon-greedy. Like in epsilon-greedy this action selection rule selects theaction with the highest expected return (1− ε) part of the time and a random action ε part of thetime (0 ≤ ε ≤ 1). p is a randomly drawn number in the range [0, 1]. This action selection ruleguarantees that only balanced motions are selected from the active Q-table in case of a greedyaction if possible. Also, it guarantees that all states continue to be visited.

In case of simple connectivity, no M table is required and instead there is only a Qbalance. Inthis case all states that are in the current basin could be favored slightly in a total compoundaction-value function which is shaped by a variety of reward criteria. For example:

Q(s, a)tot = Q(s, a)balance +Q(s, a)other. (6.10)

Wherein Q(s, a)tot is the total compound action-value function, Q(s, a)balance the action-valuefunction which evolved from the strict balance rewards scheme and Q(s, a)other all other action-value functions corresponding to any other reward scheme. Action selection can take place ac-cording to epsilon-greedy. In this way all states continue to be visited, which is required forconvergence.

Another way to utilize what has been learned by the balance subtasks is to verify whether a tra-jectory proposed by a designer is balanced and what control inputs are required to trace it. Thisverification could be done in two steps. Firstly, the trajectory consists of a series of states and anynext discrete state on the trajectory must be reachable by a control input from the previous state.Secondly, every state must lie within the computed basin of balance.

A third way to utilize what has been learned is to counteract large disturbances such that atleast balance is maintained. For example, by switching to a greedy policy using the action-value

67

Page 68: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Compound reinforcement learning: dynamics interpretation

function corresponding to the balance subtasks if the tracking error of some other controllerexceeds some limit.

In conclusion there are many possibilities in which balance can be integrated. It is importantto note that by the proposed definition of the reward scheme, balance emerges, which is alreadyinteresting in itself. However, balance alone or executing a sequence of contact modes is generallynot specific enough to be practically useful. Just like a stable system is not interesting in itself,but the specific kinds of motions that can be executed with it. Therefore, proper integration is animportant issue.

6.4 Conclusion

A method, called compound reinforcement learning, has been develop to devise a complex taskinto subtasks. Every task terminates in a goal region, this enables, with an adaptation of Q-learning, verification whether a sequence of subtasks can be executed without the need to relearn.An algorithm has been proposed which checks connectivity of subtasks. In the literature meth-ods to break down the learning problem into subtasks are available, however, to the best of ourknowledge connectivity of subtasks is either not considered or learned explicitly. Additionally, ourapproach allows reconstruction of the attraction domain.

It has been explained how the introduced subtask division could be used to balance legged lo-comotion in a preliminary way. Hereto, firstly the most general definition of balance has beenadopted, not closing certain contacts and, crucially, has been redefined as reaching a new contactmode from a finite set of contact situations or maintaining the current contact mode. By con-necting the defined subtasks, many types of locomotion can be performed: walking, running,starting, stopping, stepping backward, etc. Also all these types of actions are directly accessibleto recover from a large disturbance. In practice however, computational requirements are ex-tremely high. It has been explained how balance can be computed for legged machines, but theconnectivity algorithm has been implemented on a simple multi-surface bouncing ball system asa demonstration of the principle.

In the future the proposed division in subtasks could be verified for legged locomotion by simu-lation of a walker and formalized. Furthermore, it may be established by proof if this division insubtasks indeed closely resembles the balance condition.

68

Page 69: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Chapter 7

Conclusions and recommendations

7.1 Conclusions

In this report it has been investigated how the reinforcement learning action-value function canbe interpreted in terms of dynamics for legged locomotion. The first important observation isthat the interpretation of the action-value function depends on the design of the reward scheme.The reward scheme can be designed freely and this offers the opportunity to design a rewardscheme such that the action-value function has an interpretation in terms of dynamical systemsor in terms of legged locomotion balance. The property that the reward scheme can be designedfreely has been exploited in this work.

It has been validated on a qubic-linear systems example that it is indeed possible to interpret theaction-value function in terms of dynamical systems if a suitable reward scheme is chosen. Moreprecisely, in terms of a basin of attraction to a certain region in the state space. Additionally,it has been demonstrated on a compass walker example that it is possible to choose a rewardscheme such that the basin of attraction to a set of end-of-step postures can be approximatelyreconstructed from the reward scheme. The basin of attraction has been visualized for bothsystems and has been compared to the analytical results.

Furthermore, it has been proposed that the obtained interpretation of the action-value functionis useful for the design of a subtask division, called compound reinforcement learning. Morespecifically, connectivity conditions between controllers that execute subtasks have been estab-lished which in other works had to be specified by the designer or learned by trial and error.The advantage of using subtasks is that complex root tasks can be constructed from the subtaskswithout the need to relearn the individual subtasks (or the new task as a whole). The balanceproblem, generally defined as non-falling, can be reformulated in the problem of reaching anytarget region out of a set of target regions.

The first contribution of this work, although seemingly straightforward, is that it has been demon-strated that indeed the domain of attraction to basically any kind of predefined motion of a dy-namical system can be computed from the action value function. This basic knowledge couldprove useful to derive design guidelines for reward schemes, to visualize dynamical behavior andto improve existing reinforcement algorithms. The second important contribution is that the bal-ance problem has been reformulated in a problem of reaching any target region out of a set oftarget regions. The third important contribution is that an algorithm has been proposed to checkconnectivity of a sequence of subtasks and compute the basin of attraction to that sequence of

69

Page 70: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Conclusions and recommendations

subtasks without the need te relearn. This enables parallel learning.

7.2 Recommendations

The work described in this report leaves many questions and challenges for future research.Firstly, some conjectures have been made for which it would be very interesting to find a mathe-matical proof and to formalize them.

• The basin of attraction to a certain target region can be reconstructed form the action-valuefunction. Establish by theoretical proof that for some classes of systems in the limit ofinfinitely fine discretization the region of attraction, as reconstructed from the action-valuefunction, indeed equals the theoretical region of attraction.

• Prove the computational benefit of using the proposed subtasks.

• Prove that the subtasks connectivity conditions are indeed valid.

Secondly, reinforcement learning algorithms store the dynamics in terms of a reward structure.As a result, every time a new reward structure is imposed, previous knowledge cannot be reused,or only in a limited way by hierarchical reinforcement learning. It could be beneficial to learnthe causal dynamics, i.e. what actions lead to what state transitions, separately from the rewardstructure. Because in this way, if the causal dynamics has been found once, next, given anyreward structure, the corresponding controller can be computed without a new time consuminginteraction process (with the physical world or a simulation of the physical world).

Thirdly, the proposed division in subtasks aimed to balance legged locomotion should be verifiedby simulation and experiments with legged machines.

70

Page 71: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Bibliography

[1] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific,1996. ISBN: 1-886529-10-8.

[2] J.W. Brandenburg. Hierarchical reinforcement learning for a kicking soccer robot. Master’sthesis, Delft University of Technology, 2007.

[3] B. Brogliato. Modeling, stability and control of biped robots - a general framework. automat-ica, 40:1647–1664, 2004.

[4] T.G. Dietterich. Hierarchical reinforcement learning with the max-q value decomposition.Journal of artificial intelligence research, 13:227–303, 2000.

[5] M. Garcia, A. Chatterjee, A. Ruina, and M.J. Coleman. The simplest walking model: stability,complexity and scaling. ASME Journal of Biomechanical Engineering, 120(2):281–288, 1998.

[6] Ambarish Goswami. Postural stability of biped robots and the foot-rotation indicator (fri)point. The international journal of robotics research, 18(6):523–533, 1999.

[7] W.P.M.H. Heemels and B. Brogliato. The complementarity class of hybrid dynamical sys-tems. European Journal of Control, 9, 2003.

[8] W.P.M.H. Heemels and B. de Schutter. Lecture notes: Modeling and control of hybrid dy-namical systems, 2007.

[9] D.E.G. Hobbelen. Limit Cycle Walking. PhD thesis, Deflt University of Technology, 2008.

[10] D.G.E. Hobbelen and M. Wisse. A disturbance rejection measure for limit cycle walkers:The gait sensitiviry norm. IEEE transactions on robotics, 23(6):1213–1224, 2007.

[11] Philip Holmes, Robert Full, Dan Koditschek, and John Guckenheimer. The dynamics oflegged locomotion: Models analyses and challenges. SIAM Review, 48:207–304, 2006.

[12] Y. Hurmuzlu and G.D. Moskowitz. Role of impact in the stability of bipedal locomotion.international journal of dynamics and stability of systems, 1(3):217–234, 1986.

[13] G. Schmidt J. Denk. Synthesis of a walking primitive database for a humanoid robot usingoptimal control techniques. Proceedings of the IEEE-RAS International Conference on Hu-manoid Robots, pages 319–326, 2001.

[14] G. Schmidt J. Denk. Synthesis for walking primitive databases for biped robots in 3d envi-ronments. Proceedings of the IEEE International Conference on Robotics and Automation, 2003.

71

Page 72: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

BIBLIOGRAPHY

[15] Yasuo Kuniyoshi and Shinji Sangawa. Early motor development from partially orderedneural-body dynamics: experiments with a cortico-spinal-musculo-skeletal model. Biologi-cal Cybernetics, 95:589–605, 2006.

[16] J. Kuznetsov. Towards scalable and modular reinforcement learning on a robocup robot.Master’s thesis, Delft University of Technology, 2005.

[17] V.Potkonjak M. Vukobratovic, B. Borovac. Zmp : a review of some basic misunderstandings.international journal on humanoid robotics, 3:153–176, 2006.

[18] V.Potkonjak M. Vukobratovic, B. Borovac. Towards a unified understanding of basic notionsand terms in humanoid robotics. Robotica, 25:87–101, 2007.

[19] Y. Stepanenko M. Vukobratovic. On the stability of anthropomorphic systems. MathematicalBiosciences, 15:1–37, 1972.

[20] M. Vidyasagar M.W. Spong, S. Hutchinson. Robot Modeling and Control. John Wilay & SonsInc, 2006. ISBN: 9780471649908.

[21] J.E. Pratt and R. Tedrake. Velocity based stability margins for fast bipedal walking. Pro-ceedings of the First Ruperto Carola Symposium on Fast Motions in Biomechanics and Robotics:Optimization and Feedback Control, pages 1–45, 2005.

[22] H. Nijmeijer R. Leine. Dynamics and Bifurcation of Non-Smooth Mechanical Systems. Springer,2004. ISBN: 978-3-540-21987-3.

[23] A. Ruina R. Tedrake S. Collins, M. Wisse. Efficient bipedal robots based on passive-dynamicwalkers. Science, 307:1082–1085, 2005.

[24] Philippe Sardain and Guy Bessonnet. Forces acting on a biped robot. center of pressure -zero moment point. IEEE transactions on systems, man and cybernetic- Part A, 34(5):630–637,2004.

[25] E. Schuitema. Hierarchical reinforcement learning. Master’s thesis, Delft University ofTechnology, 2006.

[26] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

[27] Toru Takenaka. The control systems for the honda humanoid robot. Age and Ageing, 2006.

[28] Yuichi Tazaki and Jun ichi Imura. Graph-based model predictive control of a planar bipedalrobot. 2006.

[29] R. Tedrake, T.W. Zhang, and H.S. Seung. Learning to walk in 20 minutes. Proceedings of theFourteenth Yale Workshop on Adaptive and Learning Systems, Yale University, New Haven, CT,2005.

[30] S. Troost. Multi-agent reinforcement learning within individual robots. Master’s thesis, DelftUniversity of Technology, 2008.

[31] M. Wisse, AL Schwab, RQ van der Linde, and FCT van der Helm. How to keep from fallingforward: elementary swing leg action for passive dynamic walkers. IEEE Transactions onRobotics, 21(3):393–401, 2005.

72

Page 73: Legged locomotion: towards reinforcement learning dynamics ......Enkele samengestelde bewegingen zijn toegepast in een voorbeeld waarin een stuiterbal op verschillende oppervlakken

Appendix A

Mathematical review

A.1 Convexity

A convex set contains all line segments between any two points in the set.

Definition A.1 (Convex Set). A set C ⊂ Rn is convex if ∀x ∈ C and ∀y ∈ C also (1 + q)x +qy ∈ C for arbitrary q ∈ (0, 1).

A.2 Set-valued functions

A single-valued function f : R→ R assigns a function value f(x) to any element x of its domain.Discontinues functions do not belong to the set of a single-valued functions because a discontin-ues function can have multiple values at one element of its domain. A set-valued function assignsa set of values F(x) to any element x of its domain. Discontinues functions belong to this moregeneral class of set valued functions.

A.3 Normal cone, tangential cone and proximal point

Definition A.2 (Cone). A cone is a subset C ⊂ Rn for which holds ∀x ∈ C that λx ∈ C,λ > 0.

Definition A.3 (Normal Cone). The set of vectors y that are normal to x ∈ C with respectto C. That is:

Nc(x) = y|yT (x∗ − x) ≤ 0, x ∈ C, ∀x∗ ∈ C.

Definition A.4 (Indicator function). Let C be a convex set. Then the indicator function ofC is defined as:

ψC(x) =

0 x ∈ C+∞ x 6∈ C

Definition A.5 (Proximal Point). The proximal point of a convex set C to a point z is theclosest point x ∈ C to z. That is:

proxC = arg minx∗∈C

‖ z − x∗ ‖, z ∈ Rn

73