36
Studying Evolving SoHware Ecosystems inspired by ecological models Tom Mens, Maelick Claes Service de Génie Logiciel Philippe Grosjean Service d’écologie numérique des milieux aqua@ques [email protected]/genlog/projects/ecos

Studying Evolving Software Ecosystems Inspired by Ecological Models

Embed Size (px)

DESCRIPTION

Research in progress presented by Tom Mens and Maelick Claes (Software Engineering Lab, University of Mons) at the SATToSE 2013 software evolution research seminar at the University of Bern, 9 July 2013

Citation preview

Page 1: Studying Evolving Software Ecosystems Inspired by Ecological Models

Studying  Evolving  SoHware  Ecosystemsinspired  by  ecological  models

Tom  Mens,  Maelick  ClaesService  de  Génie  Logiciel

Philippe  GrosjeanService  d’écologie  numérique  des  milieux  aqua@ques

[email protected]/genlog/projects/ecos

Page 2: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern

Collaborators

2

?

Page 3: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern

Long-­‐term  goals

• Determine the main factors that drive the success or failure of OSS projects within their ecosystem

• Investigate new techniques and mechanisms to predict and improve survivability of OSS projects– Inspired by research in biological ecology

• Use these insights to help– the developer community to improve upon their

practices– companies and users to compare and adopt OSS

projects

3

Page 4: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 4

Standing  on  the  shoulders  of  giants

Lamarck

Lotka

Volterra

Darwin

Pareto

Page 5: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 5

TerminologyBiological  ecosystem

Defini@ons

• Ecology:  the  scien)fic  study  of  the  interac)ons  that  determine  the  distribu)on  and  abundance  of  organisms

• Ecosystem:  the  physical  and  biological  components  of  an  environment  considered  in  rela)on  to  each  other  as  a  unit– combines  all  living  organisms  

(plants,  animals,  micro-­‐organisms)  and  physical  components  (light,  water,  soil,  rocks,  minerals)

Example:  coral  reefs

• High  biodiversity:  polyps,  sea  anemones,  fish,  mollusks,  sponges,  algae

Page 6: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 6

TerminologySoHware  ecosystem

Business-­‐oriented  view

• “a  set  of  actors  func)oning  as  a  unit  and  interac)ng  with  a  shared  market  for  so<ware  and  services,  together  with  the  rela)onships  among  them.”  (Jansen  et  al.  2009)

Examples

• Eclipse• Android  and  iOS  app  store

Page 7: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 7

SoHware  ecosystem

Development-­‐centric  view

• “a  collec)on  of  so<ware  products  that  have  some  given  degree  of  symbio)c  rela)onships.”(MesserschmiK  &  Szyperski  2003)

• “a  collec)on  of  so<ware  projects  that  are  developed  and  evolve  together  in  the  same  environment.”(Lungu  2008)

Examples

• GnomeKDE

• DebianUbuntu

• R’s  CRAN

• Apache

Page 8: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 8

Comparison

Page 9: Studying Evolving Software Ecosystems Inspired by Ecological Models

Biological  evolu@on

AND  BY  A  DUMMY

Page 10: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 10

Ecological  theories  ofevolu@on  of  species

• Lamarckism-­‐  animal  organs  and  behaviour  can  change  according  to  the  way  they  are  used

-­‐  those  characteris@cs  can  transmit  from  one  genera@on  to  the  next  to  reach  a  greater  level  of  perfec@on

• Example-­‐  giraffe’s  necks  have  become  longer  while  trying  to  reach  the  upper  leaves  of  a  tree

Jean-­‐Bap@ste  Lamarck(1744–1829)

Page 11: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 11

Ecological  theories  ofevolu@on  of  species

• Darwinism-­‐  all  species  of  life  have  descended  over  @me  from  common  ancestors

-­‐  this  branching  padern  of  evolu@on  resulted  from  natural  selec@on,  similar  to  ar@ficial  selec@on  in  selec@ve  breeding

• Example– 13  types  of  Galapagos  finches,  

same  habits  and  characteris@cs,  but  different  beaks

Charles  Darwin(1809–1882)

Page 12: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 12

Ecological  theories  ofevolu@on  of  species

Hologenome  theory

• The  unit  of  natural  selec@on  is  the  holobiont:  the  organism  together  with  its  associated  microbial  communi@es,  that  live  together  in  symbiosis.

• The  holobiont  can  adapt  to  changing  environmental  condi@ons  far  more  rapidly  than  by  gene@c  muta@on  and  selec@on  alone.

Compe@@on  vs  coopera@on

• While  Darwin’s  theory  emphasises  compe))on  (survival  of  the  fidest),  hologenome  theory  also  includes  coopera)on  (through  symbiosis)

Page 13: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 13

Evolu@on  HistoryEcology

Darwin  (1837)• Evolu@on  history  of  species  can  be  represented  by  a  phylogene)c  tree.

• Describes  the  evolu@onary  rela@onships  among  species  assuming  that  they  share  acommon  ancestor.

Page 14: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 14

Evolu@on  HistoryEcology

Re$culate  evolu$on

• Unlike  in  Darwinian  model,  evolu@on  history  is  represented  using  a  graph  structure

• When  re@cula@on  of  species  occurs,  two  or  more  evolu@onary  lineages  are  combined  at  some  level  of  biological  organiza@on.

• Causes– hybrid  specia)on  (two  lineages  

recombine  to  create  a  new  one)  – horizontal  gene  transfer  (genes  

are  transferred  across  species)

Page 15: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 15

Evolu@on  HistorySoHware

Page 16: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 16

Trophic  web  (food  chain)in  natural  ecosystems

Page 17: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 17

Trophic  web  insoHware  ecosystems

Producer-­‐consumer  rela@on

Users

Peripheral  developers

Core  developers

Onion  model

TOP-­‐DOWNchange  requests  &  bug  reports

BOTTOM-­‐UPchanges  in  core  projects  and  architecture  

Page 18: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 18

Core  Architecture  -­‐  orWhy  developers  are  polyps

Coral  reef  ecosystem

• Sclerac)nian  coral  polyps  are  responsible  for  crea@ng  the  coral  reef  structure

• This  coral  reef  is  required  for  the  other  species  of  the  ecosystem  to  thrive.

SoHware  ecosystem

• Core  developers  are  responsible  for  crea@ng  the  core  soHware  architecture

• Based  on  this  core  architecture,  other  developers  and  third  par@es  can  create  other  projects,  services,  and  so  on.

Page 19: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 19

Ecosystem  Dynamics

Predator-­‐prey  rela@onship

• An  instance  of  the  consumer-­‐resource  rela@onship

• Predators  (hun@ng  animals)  feed  upon  their  prey  (adacked  animals)

Dynamic  model

• Two  mutually  dependent  parametric  differen@al  equa@ons(Lotka-­‐Volterra  1925/1926)

Page 20: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 20

SoHware  EcosystemDynamics

Analogies  based  on  predator-­‐prey  rela$onship

• Debuggers  are  predators,  soHware  defects  are  prey[Calzolari  et  al.  Maintenance  and  tes)ng  effort  modeled  by  linear  and  nonlinear  dynamic  systems,”  Informa)on  and  So<ware  Technology,  43(8):  477  –  486,  2001]

• Developers  are  predators,  the  informa@on  they  seek  is  prey[Lawrance  et  al.    Scents  in  programs:  Does  informa)on  foraging  theory  apply  to  program  maintenance?  VL/HCC  2007,  pp.  15–22]

Page 21: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 21

SoHware  EcosystemDynamics

Analogies  based  on  predator-­‐prey  rela$onship

• Dual  views  in  a  soHware  ecosystem– Developers  are  predators,  the  

projects  they  work  on  are  prey

– Projects  are  predators  that  feed  upon  the  cogni@ve  resources  of  their  developers

Bipar@te  developer-­‐project  graph

project  1

project  2

project  3

Page 22: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern

• Stability:  the  capacity  to  maintain  an  equilibrium  over  longer  periods  of  @me

• Resistance:  the  ability  to  withstand  environmental  changes  without  too  much  disturbances  of  its  biological  communi@es

• Resilience:  the  ability  to  return  to  an  equilibrium  aHer  a  disturbance

• Higher  biodiversity  favours  these  characteris@cs

Other  desirable  ecosystem  characteris@cs

22

Defining and Evaluating a Measureof Open Source Project Survivability

Uzma Raja, Member, IEEE Computer Society, and Marietta J. Tretter

Abstract—In this paper, we define and validate a new multidimensional measure of Open Source Software (OSS) project survivability,

called Project Viability. Project viability has three dimensions: vigor, resilience, and organization. We define each of these dimensions

and formulate an index called the Viability Index (V I) to combine all three dimensions. Archival data of projects hosted atSourceForge.net are used for the empirical validation of the measure. An Analysis Sample (n ! 136) is used to assign weights to each

dimension of project viability and to determine a suitable cut-off point for V I. Cross-validation of the measure is performed on a hold-out Validation Sample (n ! 96). We demonstrate that project viability is a robust and valid measure of OSS project survivability that can

be used to predict the failure or survival of an OSS project accurately. It is a tangible measure that can be used by organizations tocompare various OSS projects and to make informed decisions regarding investment in the OSS domain.

Index Terms—Evaluation framework, external validity, open source software, project evaluation, software measurement, software

survivability.

Ç

1 INTRODUCTION

OPEN Source Software (OSS) projects are developed anddistributed for free, with full access to the project

source code. Recently there has been a significant increasein the use of these projects. Some OSS projects have earnedthemselves a high reputation and corporate sponsorships.Large corporations (e.g., IBM, SUN microsystems) arebecoming involved with the OSS movement in variouscapacities. Projections indicate that the corporate interest inOSS projects will grow stronger in the future [1] and theseprojects will see integration in enterprise architecture [2].This increased use of OSS projects creates the need forbetter project evaluation measures.

Traditionally, software projects are evaluated by con-formance to budget, schedule, and user requirements [3], [4],[5], [6], [7], [8]. These measures, however, are difficult tomap to OSS projects, which are developed through anetwork of volunteer participants, with no defined budget,schedule, or customer. Although there is a surge in theinvestment in OSS projects [1], research indicates that a largenumber of OSS projects fail [9], [10]. Some have questionedthe operational reliability and quality of OSS projects [11].Since there are no contractual or legal bindings for providingOSS updates or maintenance services, businesses investinghuman or financial capital on adoption of OSS projects needthe ability to evaluate whether the project will continue toexist or not [12]. Development teams need to measure

project survivability to control and improve performance.Individual and corporate users need a measure of projectsurvivability to compare the available OSS projects beforemaking decisions regarding project adoption.

In this paper, we define and validate a new multi-dimensional measure of OSS project survivability, calledProject Viability. OSS projects provide access to theirdevelopment archives, thereby providing a unique oppor-tunity to conduct empirical research [13] and developreliable measures [14], [15]. In the following sections, wedefine, formulate, and validate project viability. Section 2provides a brief overview of the existing empirical researchin OSS and the background of project survivability. Section 3defines the dimensions of project viability and formulatesan index to measure it. Section 4 discusses the empiricalevaluation framework and validates the new measure usingOSS project data. Discussion of the results is presented inSection 5 and conclusions are given Section 6.

2 BACKGROUND

A large number of OSS projects are available for use.However, the failure rate of these projects is high [9]. Theevaluation of OSS projects is different than CommercialSoftware Systems (CSS) [16]. The adopters of OSS projectsneed a mechanism to compare the chances of failure orsurvival of the available projects. This would allow betterdecisions regarding corporate resource investment.

A range of measures has been used in prior research toevaluate OSS projects. Godfrey and Tu [17] examined theevolution of the Linux kernel and its growth pattern in oneof the first empirical studies in the OSS domain. They usedthe Source Lines of Code (SLOC) to compare the growthpattern of Linux to CSS projects and found evidence thatOSS growth rates are significantly high compared to CSSprojects. Paulson et al. [18] compared OSS and CSS projectsusing a diverse sample of OSS projects and found no

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 38, NO. 1, JANUARY/FEBRUARY 2012 163

. U. Raja is with the Department of Information Systems, Statistics andManagement Science, The University of Alabama, Box #870226,300 Campus Drive, Tuscaloosa, AL 35487. E-mail: [email protected].

. M.J. Tretter is with the Department of Information and OperationsManagement, Texas A&M University, Mail Stop #310D, WehnerBuilding, College Station, TX 77840. E-mail: [email protected].

Manuscript received 30 Oct. 2009; revised 14 June 2010; accepted 21 Aug.2010; published online 1 Apr. 2011.Recommended for acceptance by R. Jeffery.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TSE-2009-10-0294.Digital Object Identifier no. 10.1109/TSE.2011.39.

0098-5589/12/$31.00 ! 2012 IEEE Published by the IEEE Computer Society

Page 23: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern

Measuring  diversity

Based on Shannon’s notion of entropy and 2nd law of thermodynamics

Species diversityX = set of n distinct species xip(xi) = proportion of all individuals

that belong to species xi

23

Ecosystem  Biodiversity

• Biodiversity:  The  degree  of  varia@on  of  species  within  a  given  ecosystem

• Interpreta@on-­‐  Maximum  diversity  if  all  species  have  same  number  of  individuals

-­‐  Low  diversity  if  a  par@cular  species  dominates  the  others

H(X) = − p(xi)ln p(xi)i=1

n

Page 24: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 24

Evolu@on  of  diversity  in  open  source  soHwareEconometric  indices

• Gini  and  Theil  are  measures  of  inequality  in  a  distribu@on

Figure 3. Box plot of Gini coefficients across all analyzed systems.

[6], a new managed language for the .NET platform, wasannounced. Like in Java, C# programs are compiled intoa machine-independent, language-appropriate representa-tion defined by the Common Language Infrastructure [21].Moreover, C# and Java are very closely related and wetherefore asked ourselves whether programs written in C#exhibit distribution profiles similar to the ones we observedin Java. Unfortunately, the number of freely-available,open-source systems developed in C# framework that metour selection criteria is rather limited. So, we began ourstudy with systems that were originally written in Java andhad been ported to the .NET platform in order to take ad-vantage from the knowledge gained in the analysis of theirrespective Java counterparts.

For the .NET metrics extraction, we used CLI [18], anassembly reader library that provides access to both themetadata and byte code. We added a small wrapper for thecomputation of the Gini coefficients and stored the resultingdata in a text file for further processing with JSeat.

We collected metrics data from four .NET systems:iTextSharp, NHibernate, SharpDevelop, and NAnt. Theanalysis of our 10 measures produced Gini coefficientsequivalent to the ones determined for Java systems. How-ever, there were also exceptions. We observed a shift ex-ceeding 0.4 (i.e., individual Gini coefficients doubled invalue) for almost all measures in NAnt version 0.8.3-rc1.The Gini coefficients stayed high until version 0.84-rc1,where they assumed “normal” values again. An inspectionof the developer logs provided an explanation: in version0.8.3-rc1, the NAntContrib project was integrated into theNAnt distribution. This project defines a number of utili-ties whose metrics exhibit very uneven distribution profilescaused by a centralization of event handling in a few classes.In version 0.84-rc1, the developers removed NAntContribfrom NAnt resulting in a change by�0.4, returning the Ginicoefficients for NAnt to their previous values.

4. Discussion

4.1 The Value of the Gini Coe�cient

We discovered in our analysis that Gini coefficients nor-mally change little between adjacent releases. However,changes do happen and may result in significant fluctua-tions in Gini coefficients that warrant a deeper analysis (seeFigure 4 showing selected Gini profiles for 51 consecutivereleases of the Spring framework). But why do we see sucha remarkable stability of Gini coefficients?

Figure 4. Selected Gini profiles in Spring.

Developers accumulate system competence over time.Proven techniques to solve a given problem prevail, whereuntested or weak practices have little chance of survival.If a team has historically built software in a certain way,then it will continue to prefer a certain approach over oth-ers. Moreover, we can expect that most problems in a givendomain are similar, hence the means taken to tackle themwould be similar, too. Tversky and Kahneman coined theterm “decision frame” [27] to refer to this principle in whichdecision-makers proactively organize their solutions withinwell-established and strong boundaries defined by culturalenvironment and personal preferences. These boundaries

184

Authorized licensed use limited to: SWINBURNE UNIV OF TECHNOLOGY. Downloaded on November 3, 2009 at 21:29 from IEEE Xplore. Restrictions apply.

Comparative Analysis of Evolving Software Systems Using the Gini Coefficient

Rajesh Vasa, Markus Lumpe, Philip BranchFaculty of Information & Communication Technologies

Swinburne University of TechnologyP.O. Box 218, Hawthorn, VIC 3122, AUSTRALIA{rvasa,mlumpe,pbranch}@swin.edu.au

Oscar NierstraszInstitute of Computer Science

University of BernBern, CH-3012, SWITZERLAND

[email protected]

Abstract

Software metrics offer us the promise of distilling usefulinformation from vast amounts of software in order to trackdevelopment progress, to gain insights into the nature of thesoftware, and to identify potential problems. Unfortunately,however, many software metrics exhibit highly skewed, non-Gaussian distributions. As a consequence, usual ways ofinterpreting these metrics — for example, in terms of “av-erage” values — can be highly misleading. Many metrics,it turns out, are distributed like wealth — with high concen-trations of values in selected locations. We propose to an-alyze software metrics using the Gini coefficient, a higher-order statistic widely used in economics to study the dis-tribution of wealth. Our approach allows us not only toobserve changes in software systems efficiently, but also toassess project risks and monitor the development process it-self. We apply the Gini coefficient to numerous metrics overa range of software projects, and we show that many met-rics not only display remarkably high Gini values, but thatthese values are remarkably consistent as a project evolvesover time.

1. Introduction

What is the inherent nature of software? Do softwaresystems form “perfect” societies with an equal distributionof responsibilities, or are they polarized, where some partshave to shoulder most of the load and others are just simpleservice providers? These are questions of more than passinginterest. By understanding what typical and successful soft-ware evolution looks like, we can identify anomalous situ-ations and perhaps take action earlier than might otherwisebe possible. However, we are only beginning to understandhow change and distribution of functionality affect evolvingsoftware systems [23, 29, 30].

A standard technique [8, 15, 16] to answer these ques-

tions is to identify a number of characterizing properties,collect corresponding software metrics, and render the ob-tained data into meaningful information that can assist bothdevelopers and project managers in their decision making[13, 27]. Unfortunately, software metrics data are, in gen-eral, heavily skewed [7, 12, 30], which makes precise inter-pretation with standard descriptive statistical analysis diffi-cult. Summary measures like “average” or “mean” assumea Gaussian distribution to capture the central tendency in agiven data set. However, when applied to non-Gaussian dis-tributions, central tendency measures become increasinglymore unreliable the greater the distance is between a givendistribution and a normal distribution.

The shortcomings of central tendency measures are am-plified when we wish to compare skewed distributions. Anymeaningful comparison requires additional effort to fit thedistributions in question to a specially-designed third modeldistribution [1, 26]. This transformation is not only cum-bersome but also expensive and may not yield the desiredresult. Moreover, additional problems may arise due tochanges in both the degree of concentration of individualvalues and and the total value of a distribution. Consider, forexample, the high-performance text search engine libraryLucene. The median of the heavily-skewed distribution forcyclomatic complexity [19] at class level increased from 5to 8 as new classes were added to the system. The change inthe median suggests that the overall cyclomatic complexityof Lucene increased significantly. But this interpretation isincorrect. The newly added classes had actually the oppo-site effect. What made the median increase was the growingpopulation size (i.e., the number of classes in the system),which resulted in a new middle value for cyclomatic com-plexity.

Interestingly, an approach to cope with and meaning-fully interpret unevenly-distributed data sets has alreadybeen widely adopted in the field of economics. In 1912,the Italian statistician Corrado Gini proposed the so-calledGini coefficient, a single numeric value between 0 and 1,to measure the inequality in the distribution of income or

978-1-4244-4828-9/09/$25.00 2009 IEEE Proc. ICSM 2009, Edmonton, Canada

179

Authorized licensed use limited to: SWINBURNE UNIV OF TECHNOLOGY. Downloaded on November 3, 2009 at 21:29 from IEEE Xplore. Restrictions apply.

Gini  =  A/(A+B)

Page 25: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern

Evolu@on  of  diversity  in  open  source  soHwareEconometric  indices

Commits  sentE-­‐mails  sentBug  reports  modified

Evince

25

-­‐  Theil  indexCorresponds  to  Shannon’s  no@on  of  entropy

 Evidence for the Pareto principlein Open Source Software Activity

Mathieu Goeminne and Tom MensInstitut d’Informatique, Faculte des Sciences

Universite de Mons – UMONSMons, Belgium

{ mathieu.goeminne | tom.mens }@umons.ac.be

Abstract—Numerous empirical studies analyse evolving opensource software (OSS) projects, and try to estimate the activityand effort in these projects. Most of these studies, however, onlyfocus on a limited set of artefacts, being source code and defectdata. In our research, we extend the analysis by also taking intoaccount mailing list information. The main goal of this articleis to find evidence for the Pareto principle in this context, bystudying how the activity of developers and users involved inOSS projects is distributed: it appears that most of the activityis carried out by a small group of people. Following the GQMparadigm, we provide evidence for this principle. We selecteda range of metrics used in economy to measure inequality indistribution of wealth, and adapted these metrics to assess howOSS project activity is distributed. Regardless of whether weanalyse version repositories, bug trackers, or mailing lists, andfor all three projects we studied, it turns out that the distributionof activity is highly imbalanced.

Index Terms—software evolution, activity, software project,data mining, empirical study, open source software, GQM, Pareto

I. INTRODUCTION

Numerous empirical studies aim to understand and modelhow open source software (OSS) evolves over time [1]. Inorder to gain a deeper understanding of this evolution, itis essential to study not only the software artefacts thatevolve (e.g. source code, bug reports, and so on), but alsotheir interplay with the different project members (mainlydevelopers and users) that communicate (e.g., via mailing lists)and collaborate in order to construct and evolve the software.

In this article, we wish to understand how activity is spreadover the different members of an OSS project, and how thisactivity distribution evolves over time. Our hypothesis is thatthe distribution of activity follows the Pareto principle, in thesense that there is a small group of key persons that carryout most of the activity, regardless of the type of consideredactivity. To verify this hypothesis, we carry out an empiricalstudy based on the GQM paradigm [2]. We rely on conceptsborrowed from econometrics (the use of measurement ineconomy), and apply them to the field of OSS evolution.In particular, we apply indices that have been introducedfor measuring distribution (and inequality) of wealth, anduse them to measure the distribution of activity in softwaredevelopment.

The remainder of this paper is structured as follows. Sec-tion II explains the methodology we followed and defines

the metrics that we rely upon. Section III presents the ex-perimental setup of our empirical study that we have carriedout. Section IV presents the results of our analysis of activitydistribution in three OSS projects. Section V discusses theevidence we found for the Pareto principle. Section VI presentsrelated work, and Section VII concludes.

II. METHODOLOGY

A. GQM paradigm

To gain a deeper understanding of how OSS projects evolve,we follow the well-known Goal-Question-Metric (GQM)paradigm. Our main research Goal is to understand how ac-tivity is distributed over the different stakeholders (developersand users) involved in OSS projects. Once we have gaineddeeper insight in this issue, we will be able to exploit it toprovide dedicated tool support to the OSS community, e.g.,by helping newcomers to understand how the community isstructured, by improving the way in which the communitymembers communicate and collaborate, by trying to reducethe potential risk of the so-called bus factor1, and so on.

To reach the aforementioned research goal, we raise thefollowing research Questions:

1) Is there a core group of OSS project members (develop-ers and/or users) that are significantly more active thanthe other members?

2) How does the distribution of activity within an OSScommunity evolve over time?

3) Is there an overlap between the different types of activity(e.g., committing, mailing, submitting and changing bugreports) the community members contribute to?

4) How does the distribution of activity vary across differ-ent OSS projects?

As a third step, we need to select appropriate Metrics thatwill enable us to provide a satisfactory answer to each of theabove research questions. For our empirical study, we willmake use of basic metrics to compute the activity of OSSproject members, and aggregate metrics that allow us to com-pare these basic metric values across members (to understandhow activity is distributed), over time (to understand how they

1The bus factor refers to the total number of key persons (involved in theproject) that would, if they were to be hit by a bus, lead the project intoserious problems

Brasero

Page 26: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 26

SoHware  Ecosystem  Biodiversity

• Uses  no@on  of  biodiversity  to  measure  developer  ac)vity  focus  and  module  ac)vity  focusCf.  bipar@te  author-­‐module  graph

• Based  on  no@on  of  rela)ve  entropy

• More  details:  see  results  of  hackaton.

Dual Ecological Measures of Focusin Software Development

Daryl Posnett†, Raissa D’Souza!, Premkumar Devanbu,† and, Vladimir Filkov††!University of California Davis, USA

†{dpposnett,ptdevanbu,vfilkov}@ucdavis.edu,[email protected]

Abstract—Work practices vary among software developers.Some are highly focused on a few artifacts; others make wide-ranging contributions. Similarly, some artifacts are mostly au-thored, or “owned”, by one or few developers; others have verywide ownership. Focus and ownership are related but differentphenomena, both with strong effect on software quality. Priorstudies have mostly targeted ownership; the measures of own-ership used have generally been based on either simple counts,information-theoretic views of ownership, or social-network viewsof contribution patterns. We argue for a more general concep-tual view that unifies developer focus and artifact ownership.We analogize the developer-artifact contribution network to apredator-prey food web, and draw upon ideas from ecology toproduce a novel, and conceptually unified view of measuringfocus and ownership. These measures relate to both cross-entropyand Kullback-Liebler divergence, and simultaneously providetwo normalized measures of focus from both the developer andartifact perspectives. We argue that these measures are theoret-ically well-founded, and yield novel predictive, conceptual, andactionable value in software projects. We find that more focuseddevelopers introduce fewer defects than defocused developers. Incontrast, files that receive narrowly focused activity are morelikely to contain defects than other files.

I. INTRODUCTION

Developers are the lifeblood of open source software, OSS,and their contributions are vital for OSS to thrive. Ratherthan being assigned tasks by management, OSS developers aregenerally free to choose the style, focus, and breadth of theircontributions. Some might be quite focused, working on onespecific subsystem; others may contribute to many differentsubsystems. An device driver expert, for example, may con-tribute very specialized knowledge to an open source project,focusing on only a few files or packages. His contributions to asmall subset of modules1 may be his only contribution duringhis tenure with the project. In contrast, a project leader maywork on a variety of different tasks touching many moduleswithin a project. While OSS developers are free to choosetheir contribution styles, such choices are not inconsequential,especially to the central issue of software quality.

A dominant theme emerging from previous work in thisarea is module ownership [1], [2], [3]. Low ownership of amodule, i.e., too many contributors, can adversely impact codequality. There is, however, an entirely different perspective,developer’s attention focus, which is relatively unexplored.Human attention and cognition are finite resoucres [4]. Whendifferent tasks are simultaneously engaged, they can compete

1We use modules to mean either packages or files, depending on the context.

for mental resources and task performance can suffer [5]. Adeveloper engaged in many different tasks carries a greatercognitive burden than a more focused developer. Interestingly,the developer and module perspectives are, conceptually sym-metric, dualistic views of focus. From a module’s perspective,strong ownership indicates a strong focused contribution. Werefer to this as module activity focus, or MAF , a measure ofhow focused the activities are on a module. Symmetrically, werefer to the developer’s attention focus, or DAF , a measureof how focused the activities are of a particular developer.

A surprising, but natural analogy for MAF and DAF , arepredator-prey food webs from ecology. In a sense, modulesare predators that “feed upon” the cognitive resources ofdevelopers. As the number of developers contributing to amodule increases, the diversity of cognitive resources uponwhich the module “feeds” also increases; likewise, a developeris a “prey” whose limited cognitive resources are spread overthe modules that “prey” upon her.

Ecosystem diversity is of great interest to ecologists.Williams and Martinez call the roles complexity and diversityplay “[o]ne of the most important and least settled questionsin ecology.” [6] This diversity has two symmetric perspectives,both from a prey’s perspective, and a predator’s perspective.Ecologists have developed sophisticated symmetric measuresof predator-prey relationships, drawing upon ideas such asentropy and Kulback-Leibler divergence, that simultaneouslycapture both perspectives. We adapt these measures for soft-ware engineering projects into the metrics MAF and DAF .

In this work, we employ the methodology presented by ElEmam to validate our measures [7]. In particular, we showthat the DAF and MAF measures succeed in distinguishingimportant cases that extant measures don’t capture. We makethe following contributions:

• We adapt terminology and motivation from ecology,based on bipartite graphs;

• We incorporate and generalize previous results on devel-oper and artifact diversity;

• We provide easy to compute measures of focus, MAFand DAF , normalized to facilitate comparison within andacross projects;

• We show these measures more precisely capture out-comes relevant to software researchers and practitioners.

This novel analysis simultaneously considers focus bothfrom the artifact perspective and the author perspective.Researchers can use our MAF and DAF metrics to more

978-1-4673-3074-9/13/$31.00

c� 2013 IEEE

ICSE 2013, San Francisco, CA, USA

452

ICSE  2013

Page 27: Studying Evolving Software Ecosystems Inspired by Ecological Models

Migra@on  in  soHware  ecosystems

Gnome  case  study

Page 28: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern 28

Migra@on  in  soHware  ecosystems

• How  do  soHware  projects  evolve?– Analogy  to  “gene  transfer”  in  re@culate  evolu@on

• Transfer  of  knowledge– Sharing  or  migra@on  of  contributors  across  projects

• Transfer  of  code– Copy-­‐paste  reuse  and  code  cloning– Branching  and  merging  of  code  repositories

Page 29: Studying Evolving Software Ecosystems Inspired by Ecological Models

C

Java

Objective C

Python

Lisp

JS

ASP.Net

C/C++ HeaderC++

Perl

yacc

C#

IDL

HaskellObjective C++

lexAssembly

Visual BasicPHP

Ruby

Tcl/Tk

1e+05

1e+07

100 1000 10000Files

LOC

9  July  2013  -­‐  SATToSE,  Bern

Migra@on  in  soHware  ecosystemsGnome  case  study

29

• 16  years  of  ac@vity

• >  1400  projects

• >  5800  contributors  (>  4300  coders)

• >  1.3M  of  commits  (>  0.6M  of  code  commits)

• >  12M  of  file  touches  (>  6M  of  codefile  touches)

Page 30: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern

Migra@on  in  soHware  ecosystemsGnome  case  study

30

Hierarchical  clustering  of  projects

Developers  tend  to  collaborate  more  if  they  use  the  same  programming  language

Page 31: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern

Migra@on  in  soHware  ecosystemsGnome  case  study

31

Page 32: Studying Evolving Software Ecosystems Inspired by Ecological Models

Time

Joiners

1997 1999 2001 2003 2005 2007 2009 2011 20130

510

1520

2530

35

9  July  2013  -­‐  SATToSE,  Bern 32

Migra@on  in  soHware  ecosystemsGnome  case  study

Evolu@on Gimp GTK+

28 Tom Mens, Maelick Claes, Philippe Grosjean and Alexander Serebrenik

project that were not active in this project during the preceding 6-month period,but that were involved in some activity in other GNOME projects instead. Globaljoiners are incoming coders in the considered project that were not active in anyof the GNOME projects during the preceding period. A similar definition holds forthe local and global leavers. Formally, the metrics are defined as follows. Let p bea GNOME project, t a 6-month activity period (and t � 1 the previous period), c acoder, Gnome the set of GNOME’s code projects, and isDev(c, t, p) is a predicatewhich is true if and only if c made a code commit in p during t:

localLeavers(p, t) ={c|isDev(c, t �1, p)^¬isDev(c, t, p)^9p2 (p2 2 Gnome^ isDev(c, t, p2))}

globalLeavers(p, t) ={c|isDev(c, t �1, p)^8p2 (p2 2 Gnome ) ¬isDev(c, t, p2))}

localJoiners(p, t) ={c|isDev(c, t, p)^¬isDev(c, t �1, p)^9p2 (p2 2 Gnome^ isDev(c, t �1, p2))}

globalJoiners(p, t) ={c|isDev(c, t, p)^8p2 (p2 2 Gnome ) ¬isDev(c, t �1, p2))}

Time

Joiners

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

30

Time

Joiners

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

30

Time

Joiners

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

30

evolution gtk+ gimp

Fig. 1.11 Historical evolution (timeline) of the number of local (black solid) and global (reddashed) joiners (y-axis) for three GNOME projects.

We did not find any general trend, the patterns of intake and loss of coders arehighly project-specific. Figure 1.11 illustrates the evolution of the number of localand global joiners for some of the more important GNOME projects (the figures forleavers are very similar). For some projects (e.g., evolution) we do not observea big difference between the number of local and global joiners, respectively. Theseprojects seem to attract new developers both from within and outside of GNOME.Other projects, like gimp, attract most of its incoming developers from outsideGNOME. A third category of projects attracts most of its incoming developers fromother GNOME projects. This is the case for gtk+, glib and libgnome, whichcan be considered as belonging to the core of GNOME. This observation seems tosuggests that libraries, toolkits and auxiliary projects attract more inside developers,while projects that are well-known to the outside world (such as GIMP, a popular

Timeline  (6-­‐month  intervals)of  joiners  to  Gnome  projects

-­‐  Black  =  local  joiners  from  other  Gnome  projects-­‐  Red  =  global  joiners  from  outside  of  Gnome-­‐  Blue  =  stayers

Time

Joiners

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

3035

Time

Joiners

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

3035

Page 33: Studying Evolving Software Ecosystems Inspired by Ecological Models

Time

Leavers

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

3035

Time

Leavers

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

3035

Time

Leavers

1997 1999 2001 2003 2005 2007 2009 2011 20130

510

1520

2530

35

9  July  2013  -­‐  SATToSE,  Bern 33

Migra@on  in  soHware  ecosystemsGnome  case  study

Evolu@on Gimp GTK+

-­‐  Black  =  local  joiners  from  other  Gnome  projects-­‐  Red  =  global  joiners  from  outside  of  Gnome-­‐  Blue  =  stayers

28 Tom Mens, Maelick Claes, Philippe Grosjean and Alexander Serebrenik

project that were not active in this project during the preceding 6-month period,but that were involved in some activity in other GNOME projects instead. Globaljoiners are incoming coders in the considered project that were not active in anyof the GNOME projects during the preceding period. A similar definition holds forthe local and global leavers. Formally, the metrics are defined as follows. Let p bea GNOME project, t a 6-month activity period (and t � 1 the previous period), c acoder, Gnome the set of GNOME’s code projects, and isDev(c, t, p) is a predicatewhich is true if and only if c made a code commit in p during t:

localLeavers(p, t) ={c|isDev(c, t �1, p)^¬isDev(c, t, p)^9p2 (p2 2 Gnome^ isDev(c, t, p2))}

globalLeavers(p, t) ={c|isDev(c, t �1, p)^8p2 (p2 2 Gnome ) ¬isDev(c, t, p2))}

localJoiners(p, t) ={c|isDev(c, t, p)^¬isDev(c, t �1, p)^9p2 (p2 2 Gnome^ isDev(c, t �1, p2))}

globalJoiners(p, t) ={c|isDev(c, t, p)^8p2 (p2 2 Gnome ) ¬isDev(c, t �1, p2))}

Time

Joiners

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

30

Time

Joiners

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

30

Time

Joiners

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

30

evolution gtk+ gimp

Fig. 1.11 Historical evolution (timeline) of the number of local (black solid) and global (reddashed) joiners (y-axis) for three GNOME projects.

We did not find any general trend, the patterns of intake and loss of coders arehighly project-specific. Figure 1.11 illustrates the evolution of the number of localand global joiners for some of the more important GNOME projects (the figures forleavers are very similar). For some projects (e.g., evolution) we do not observea big difference between the number of local and global joiners, respectively. Theseprojects seem to attract new developers both from within and outside of GNOME.Other projects, like gimp, attract most of its incoming developers from outsideGNOME. A third category of projects attracts most of its incoming developers fromother GNOME projects. This is the case for gtk+, glib and libgnome, whichcan be considered as belonging to the core of GNOME. This observation seems tosuggests that libraries, toolkits and auxiliary projects attract more inside developers,while projects that are well-known to the outside world (such as GIMP, a popular

Timeline  (6-­‐month  intervals)of  leavers  from  Gnome  projects

Page 34: Studying Evolving Software Ecosystems Inspired by Ecological Models

Time

Leavers

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

3035

Time

Leavers

1997 1999 2001 2003 2005 2007 2009 2011 2013

05

1015

2025

3035

Time

Leavers

1997 1999 2001 2003 2005 2007 2009 2011 20130

510

1520

2530

35

9  July  2013  -­‐  SATToSE,  Bern 34

Migra@on  in  soHware  ecosystemsGnome  case  study

Evolu@on Gimp GTK+

CF(p)  =  Collabora@on  factor  for  Gnome  project  p=  percentage  of  coders  in  p  having  contributed  to  other  Gnome  projects

CF(Gimp)  =  65,3%(low  collabora@on)

CF(GTK+)  =  94,8%(very  high  collabora@on)

CF(Evolu@on)  =  85,1%(high  collabora@on)

Page 35: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern

Migra@on  in  soHware  ecosystemsGnome  case  study

35

Page 36: Studying Evolving Software Ecosystems Inspired by Ecological Models

9  July  2013  -­‐  SATToSE,  Bern

Some  references

36

To appear in 2013 in Springer’s Empirical Software Engineering journal – manuscript No.(will be inserted by the editor)

On the variation and specialisation of workload – Acase study of the Gnome ecosystem community

Bogdan Vasilescu · Alexander Serebrenik ·Mathieu Goeminne · Tom Mens

DOI: 10.1007/s10664-013-9244-1

Abstract Most empirical studies of open source software repositories focus on theanalysis of isolated projects, or restrict themselves to the study of the relation-ships between technical artifacts. In contrast, we have carried out a case study thatfocuses on the actual contributors to software ecosystems, being collections of soft-ware projects that are maintained by the same community. To this aim, we defineda new series of workload and involvement metrics, as well as a novel approach—eT-graphs—for reporting the results of comparing multiple distributions. We usedthese techniques to statistically study how workload and involvement of ecosys-tem contributors varies across projects and across activity types, and we exploredto which extent projects and contributors specialise in particular activity types.Using Gnome as a case study we observed that, next to coding, the activities of lo-calization, development documentation and building are prevalent throughout theecosystem. We also observed notable di↵erences between frequent and occasionalcontributors in terms of the activity types they are involved in and the numberof projects they contribute to. Occasional contributors and contributors that areinvolved in many di↵erent projects tend to be more involved in the localization ac-tivity, while frequent contributors tend to be more involved in the coding activityin a limited number of projects.

Keywords open source · software ecosystem · metrics · developer community ·case study

B. Vasilescu and A. SerebrenikMDSE, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Nether-landsTel.: +31-40-2473595 Fax: +31-40-2475404E-mail: {b.n.vasilescu | a.serebrenik}@tue.nl

M. Goeminne and T. MensCOMPLEXYS Research Institute, Universite de Mons, Place du Parc 20, 7000 Mons, BelgiumTel.: +32-65-373453 Fax: +32-65-373459E-mail: {mathieu.goeminne | tom.mens}@umons.ac.be

Chapter 10Studying Evolving Software Ecosystemsbased on Ecological Models

Tom Mens, Maelick Claes, Philippe Grosjean and Alexander Serebrenik

Research on software evolution is very active, but evolutionary principles, modelsand theories that properly explain why and how software systems evolve over timeare still lacking. Similarly, more empirical research is needed to understand howdifferent software projects co-exist and co-evolve, and how contributors collaboratewithin their encompassing software ecosystem.

In this chapter, we explore the differences and analogies between natural ecosys-tems and biological evolution on the one hand, and software ecosystems and soft-ware evolution on the other hand. The aim is to learn from research in ecology toadvance the understanding of evolving software ecosystems. Ultimately, we wishto use such knowledge to derive diagnostic tools aiming to analyse and optimisethe fitness of software projects in their environment, and to help software projectcommunities in managing their projects better.

Tom Mens and Maelick Claes and Philippe GrosjeanCOMPLEXYS Research Institute, University of Mons, Belgiume-mail: tom.mens,maelick.claes,[email protected]

Alexander SerebrenikEindhoven University of Technology, The Netherlandse-mail: [email protected] work has been partially supported by F.R.S-F.N.R.S. research grant BSS-2012/V 6/5/015author’s stay at the Universite de Mons, supported by the F.R.S-F.N.R.S. under the grant BSS-2012/V 6/5/015. and ARC research project AUWB-12/17-UMONS-3,“Ecological Studies of OpenSource Software Ecosystems” financed by the Ministere de la Communaute francaise - Directiongenerale de l’Enseignement non obligatoire et de la Recherche scientifique, Belgium.

245

UMONSFaculté des Sciences

Département d’Informatique

Understanding the Evolution ofSocio-technical Aspects in Open SourceEcosystems: An Empirical Analysis of

GNOME

Mathieu Goeminne

A dissertation submitted in fulfillment of the requirements ofthe degree of Docteur en Sciences

Advisor Jury

Dr. TOM MENS Dr. XAVIER BLANCUniversité de Mons, Belgium Université de Bordeaux 1, France

Dr. VÉRONIQUE BRUYÈREUniversité de Mons, Belgium

Dr. JESUS M. GONZALEZ-BARAHONAUniversidad Rey Juan Carlos, Spain

Dr. TOM MENSUniversité de Mons, Belgium

Dr. ALEXANDER SEREBRENIKTechnische Universiteit Eindhoven, The Netherlands

Dr. JEF WIJSENUniversité de Mons, Belgium

June 2013

A historical dataset for GNOME contributorsMathieu Goeminne, Maelick Claes and Tom Mens

Software Engineering Lab, COMPLEXYS research institute, UMONS, Belgium

Abstract—We present a dataset of the open source

software ecosystem GNOME from a social point of view.

We have collected historical data about the contributors

to all GNOME projects stored on git.gnome.org, taking

into account the problem of identity matching, and as-

sociating different activity types to the contributors. This

type of information is very useful to complement the

traditional, source-code related information one can ob-

tain by mining and analyzing the actual source code.

The dataset can be obtained at https://bitbucket.org/

mgoeminne/sgl-flossmetric-dbmerge.

I. INTRODUCTION

The historical and empirical study of open sourcesoftware (OSS) ecosystems is a relatively recent but fast-growing research domain. An important characteristic ofsuch ecosystems, at least according to our definition [15],is the fact that they are made up of a set of softwareprojects sharing a community of users and contributors.A well-known example is GNOME. Its constituent soft-ware projects are designed to work together in order toconstitute a complete software desktop environment. TheGNOME projects are developed by a developer commu-nity that is spread across the world. We have observedthat it is not uncommon for a contributor to be activelyinvolved in many projects at a time [16]. In additionto this, the type of activity a contributor is involved inmay change from one person to another. For example,a very important activity involves internationalization(localization and translation), which is globally managedvia the web application Damned Lies1 for all GNOMEtranslation teams.

Many tools and datasets have been proposed to anal-yse a software project’s history, but few are availableat the level of the ecosystem because of the additionallevel of difficulty involved. It does not suffice to simplyconsider the union of all project histories belonging tothe same ecosystem. Because some projects may havecontributors in common, and some contributors may beinvolved in different projects over time, this informationneeds to be explicitly represented at the ecosystemlevel. The same is true for the types of activity of anecosystem’s contributor, and how this varies over time,and over the different projects he is involved in.

1http://l10n.gnome.org

In this paper, we present the process we have usedto create a dataset containing the historical informationrelated to contributors to the GNOME ecosystem. Ourdatabase and the tools and scripts used to created it canbe found on a dedicated Bitbucket repository2.

In contrast to many other datasets, we do not focus onsource code, since a significant amount of files commit-ted to GNOME’s project repositories do not even containcode (e.g., image files, web pages, documentation, lo-calization and many more). Such type of information isoften ignored in MSR research while it is very relevantto understand which types of activities contributors areinvolved in. For GNOME we observed, for example, thata significant fraction of the community is working oninternationalization instead of code [16].

II. MOTIVATION

An important motivation for creating a historicaldataset for analysing contributors to the GNOME ecosys-tem was inspired by the many OSS repository miningstudies that have used GNOME as a case study [2], [13].In 2009 and 2010, GNOME was part of the MSR MiningChallenge, which lead to many contributions [1], [5], [8],[9], [11], [12], [14].

Of specific interest, in the context of software ecosys-tem research, are the social interactions in the commu-nity of contributors. Following a holistic approach, [7]estimated effort and studied developer co-operation andco-ordination in GNOME, based on the version controlrepositories and mailing lists. Similarly, [4] developed anadvanced measure of individual developer contributionbased on the source code repository, mailing lists andbug tracking systems, and applied the measure to anumber of GNOME projects. [6] studied six GNOMEprojects in order to understand how contributors join,socialize and develop within GNOME. [10] studied re-lations between the GNOME contributors by means ofsocial network analysis.

In our own previous work [15], [16] we used thedataset presented in this article to statistically analyse thespecialization of workload and involvement of GNOMEcontributors across projects and activity types, and we

2https://bitbucket.org/mgoeminne/sgl-flossmetric-dbmerge

@  MSR  2013

[to  appear  in  2014]