99
IDEal : E�icient and Precise Alias-aware Dataflow Analysis
JOHANNES SPATH, Fraunhofer IEM
KARIM ALI, University of Alberta
ERIC BODDEN, Heinz Nixdorf Institut, Universitat Paderborn and Fraunhofer IEM
Program analyses frequently track objects throughout a program, which requires reasoning about aliases.
Most data�ow analysis frameworks, however, delegate the task of handling aliases to the analysis clients,
which causes a number of problems. For instance, custom-made extensions for alias analysis are complex
and cannot easily be reused. On the other hand, due to the complex interfaces involved, o�-the-shelf alias
analyses are hard to integrate precisely into clients. Lastly, for precision many clients require strong updates,
and alias abstractions supporting strong updates are o�en relatively ine�cient.
In this paper, we present IDEal
, an alias-aware extension to the framework for Interprocedural Distributive
Environment (IDE) problems. IDEal
relieves static-analysis authors completely of the burden of handling
aliases by automatically resolving alias queries on-demand, both e�ciently and precisely. IDEal
supports
a highly precise analysis using strong updates by resorting to an on-demand, �ow-sensitive, and context-
sensitive all-alias analysis. Yet, it achieves previously unseen e�ciency by propagating aliases individually,
creating highly reusable per-pointer summaries.
We empirically evaluate IDEal
by comparing TSf
, a state-of-the-art typestate analysis, to TSal
, an IDEal
-
based typestate analysis. Our experiments show that the individual propagation of aliases within IDEal
enables TSal
to propagate 10.4× fewer data�ow facts and analyze 10.3× fewer methods when compared to
TSf
. On the DaCapo benchmark suite, TSal
is able to e�ciently compute precise results.
CCS Concepts: •So�ware and its engineering→ So�ware defect analysis;
Additional Key Words and Phrases: static analysis, data�ow, aliasing
ACM Reference format:
Johannes Spath, Karim Ali, and Eric Bodden. 2017. IDEal
: E�cient and Precise Alias-aware Data�ow Analysis.
PACM Progr. Lang. 1, OOPSLA, Article 99 (October 2017), 28 pages.
DOI: 10.1145/3133923
1 INTRODUCTION
Tracking object states helps static analyses derive information about the quality and security of
a given so�ware. For example, typestate analysis (Alur et al. 2005; Fink et al. 2008; Naeem and
Lhotak 2008; Udrea and Lumezanu 2006; Whaley et al. 2002; Yahav and Ramalingam 2004) detects
programming errors that drive an object into an undesirable state. Shape analysis (Ghiya and
Hendren 1996; Sagiv et al. 1999) proves program and data-structure invariants, particularly by
reasoning about links between objects that are allocated on the heap. �is information is useful for
�nding so�ware bugs such as memory leaks (Dor et al. 2000) or detecting security vulnerabilities
by tracking the propagation of taints (i.e., private data) (Arzt et al. 2014).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for pro�t or commercial advantage and that copies bear this notice and
the full citation on the �rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permi�ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior speci�c permission and/or a fee. Request permissions from [email protected].
© 2017 ACM. 2475-1421/2017/10-ART99 $
DOI: 10.1145/3133923
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:2 Johannes Spath, Karim Ali, and Eric Bodden
1 File a = new File();
2 b = a;
3 b.open();
4 a.close();
Fig. 1. An example that depicts the importance of handling aliases in a static dataflow analysis.
Programs can access and manipulate instantiated objects through di�erent pointers, some of
which may not be visible in the current analysis scope. To compute correct results, a static analysis
must therefore properly handle aliasing. For example, the code snippet in Figure 1 allocates a File
object (line 1) and assigns it to the variable a. It then copies a to b (line 2), causing a and b to alias.
�e code then calls open on b (line 3) and close on a (line 4). If a typestate analysis that checks for
unclosed �les does not properly handle aliases, it will imprecisely report that b might not be closed
by the end of the program. A precise typestate analysis should instead report that b is closed as
soon as close is called on a, because b and a are aliases to the same File object. On the other hand,
aliasing is also a requirement for sound static analyses. �at is, the analysis shall not miss data�ows
that may occur at runtime. A key challenge of static analysis is to design analyses that are both
precise and sound at the same time, and, ideally, implemented in an e�cient manner. We identify
aliasing as a major component of the challenge and recognized a number of key de�ciencies in the
way that most static analyses currently handle aliasing.
De�ciency 1: Custom-made extensions for alias analysis are complex and cannot easily be reused. �e
implementation of static analyses are greatly simpli�ed by the use of data�ow analysis frameworks.
However, using such frameworks does not simplify the handling of aliasing: Virtually all existing
data�ow analysis frameworks (Padhye and Khedker 2013; Reps et al. 1995; Sagiv et al. 1996) o�oad
the burden of handling aliases to the client analysis, e.g., typestate, shape, or taint analysis, forcing
the designers of those static-analysis clients to consider, design, and implement their own aliasing
solutions. A common and intuitive solution is to embed the alias relationships into the client
analysis’ data�ow functions. However, this solution makes the alias analysis heavily intertwined
with the client analysis, preventing it from being reused for other clients.
De�ciency 2: O�-the-shelf alias analyses are hard to integrate well into client analyses. Because
of the problems mentioned above, static-analysis designers instead o�en resort to the use of an
o�-the-shelf alias analysis. �is option comes with its own set of problems though. In particular, it is
highly non-trivial to integrate an o�-the-shelf analysis in such a way that the resulting analysis as a
whole is maximally precise and e�cient (Spath et al. 2016). For optimal precision and performance,
client and alias analyses should integrate in a �ow-sensitive, context-sensitive, and demand-driven
manner such that aliases are only computed when necessary and only for calling contexts and
program locations in which they ma�er to the client analysis.
De�ciency 3: Current precise alias abstractions, if supporting strong updates, are relatively ine�cient.
When designing and implementing an analysis, the key to good performance and precision is the
choice of an appropriate heap model (Kanvar and Khedker 2016). To make a client analysis precise,
it typically requires so-called strong updates, which discard relationships that no longer hold at the
current program location. Performing a strong update requires must-alias information about all
aliases of a given pointer dereference (Lhotak and Chung 2011). In Figure 1, for instance, to infer
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:3
that the call to close (line 4) closes not just a but also b, the analysis must know that both variables
must-alias. To track must-alias information, instead of tracking pointers individually, current
static analyses (Fink et al. 2008; Naeem and Lhotak 2011) track entire sets of pointers through the
program. Unfortunately, the resulting powerset-abstraction is relatively ine�cient, causing current
must-alias analyses to scale badly.
To overcome these de�ciencies, we present IDEal
, an alias-aware extension to the existing
data�ow analysis framework Interprocedural Distributive Environment (IDE) (Sagiv et al. 1996). IDEal
transparently extends IDE with a reusable, precise, and e�cient heap model that includes handling
of aliasing in an optimal integrated manner. To optimize for precision, IDEal
supports strong
updates as well as a �ow-sensitive and context-sensitive analysis. To optimize for performance,
IDEal
computes alias relationships in a demand-driven, client-context-dependent manner. Further,
to the best of our knowledge, IDEal
is the �rst framework that resolves must-alias relationships
without resorting to expensive-to-compute powerset abstractions. With IDEal
, we show that it
is indeed possible to compute must-aliasing correctly even when e�ciently tracking pointers
individually, not as sets of pointers.
IDEal
can be useful wherever objects or their �elds’ contents must be tracked, in particular to
implement any kind of �eld and �ow-sensitive analysis. We speci�cally showcase IDEal
’s practical
relevance, its precision and soundness by instantiating an IDEal
-based typestate analysis (TSal
)
and empirically evaluate it in comparison to TSf, a highly precise and e�cient state-of-the-art
typestate analysis by Fink et al. (2008). Our experiments show that the heap model in IDEal
reduces
the amount of relevant data�ow facts to be propagated by a factor of 10.4×. In line with that, TSal
analyzes 10.3× fewer methods compared to TSf. Despite IDE
alcomputing aliases on-demand, and
TSf
relying on a pre-computed whole-program pointer analysis, TSal
is as e�cient and precise as
TSf
even in the worst case, where it is applied to large benchmarks as a whole.
In a case study we discuss IDEal
as the underlying framework of an analysis that detects
incorrect usages of the Java Cryptographic Architecture (JCA) library. �e JCA is a frequently
used cryptographic library that ships with any standard Java installation. �e analysis instantiates
IDEal
as both a taint and a typestate analysis. In experiments on 200 Android applications, the
IDEal
-based analysis terminates within two minutes per application on average, showing that
IDEal
can be used to implement also feature-rich static analyses e�ciently. Our results further
con�rm earlier studies showing that few applications use cryptography correctly (Egele et al. 2013).
In summary, this paper makes the following contributions:
● We present IDEal
, an extension to the IDE framework that resolves aliases automatically,
e�ciently, and precisely.
● We provide a practical solution to perform sound strong updates, while propagating aliases
individually.
● We discuss our implementation of an IDEal
-based typestate analysis, and experimentally
evaluate its precision and recall compared to the state of the art, and assess its performance
at runtime.
● We present a full implementation of a client analysis, a crypto usage-checker, which uses
IDEal
both for taint and typestate analysis.
2 BACKGROUND
�is section presents background information about the main building blocks of IDEal
: the IDE
framework, its be�er-known historical predecessor IFDS, and the alias analysis Boomerang.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:4 Johannes Spath, Karim Ali, and Eric Bodden
2.1 The Original IFDS Algorithm
�e algorithm for solving Interprocedural Finite Distributive Subset (IFDS) problems (Reps et al.
1995) is an e�cient �xed-point algorithm that can be used to de�ne a �ow- and context-sensitive
data�ow analysis. IFDS has three inputs, a �nite data�ow-domainD, a supergraph (an interprocedural
control-�ow graph), and �ow functions f ∶ S × D → P(D) that transform a single data�ow fact
d ∈ D before a statement s ∈ S to a set of facts that hold a�er the statement. In IFDS, �ow functions
are of four di�erent types. Normal-�ow functions specify the transformation of data�ow facts at
non-call statements. At call sites, the call-to-return-�ow function propagates data�ow facts at the
side of the caller. �e call-�ow function maps data�ow facts from the caller’s scope to those of
the potential callees. �e return-�ow function maps data�ow facts at exit points of a callee to the
successor statements of the original call site. Internally, IFDS transforms the data�ow analysis
into a reachability problem over an exploded supergraph ESG. A node in ESG comprises a data�ow
fact d ∈ D and its related statement s . �roughout the paper, we use ⟨s,d⟩ to refer to a node with
statement s and data�ow fact d in ESG.
Distributive �ow functions are key to the e�ciency of IFDS. For any two data�ow sets A,B ⊆ Dand �ow function f , f (A∪ B) = f (A) ∪ f (B) must hold. �is property makes it sound and precise
to propagate facts d ∈ D individually. Non-distributive frameworks instead must always propagate
entire �ow setsA ⊆ D. IFDS is particularly e�cient because distributivity enables storing point-wise
procedure summaries, one for each abstract input to a function, that can be reused as soon as the
matching individual fact is seen again at another call site.
2.2 The Original IDE Algorithm
Sagiv et al. (1996) extended IFDS to Interprocedural Distributive Environments (IDE). In addition to
the ESG of IFDS, IDE computes environments, functions env ∶D → L, where L is a bounded-height
la�ice, and D is the �nite data�ow-domain of the extended IFDS instance. �e environments
encode mappings of data�ow elements of D to values in the la�ice L. IDE expects environment
transformers for each statement to compute the environments. �e transformers are functions
t ∶Env(D,L)→ Env(D,L), where Env(D,L) is the set of all environments. Similar to �ow functions,
the environment transformers describe the e�ect of a statement on the la�ice value for a particular
data�ow fact. IDE requires those environment transformers to be distributive: (t(⊓ienvi))(d) =⊓i(t(envi))(d) for any d ∈ D and env1,env2, ... ∈ Env(D,L). �is property allows an environment
transformer to be split and represented by edge functions, functions of the form f ∶L → L that
are assigned to the edges of the ESG. Evaluation of the edge functions than computes the same
environment as the environment transformer. Using of identity function as edge functions on every
edge of the ESG, IDE can be used to solve IFDS problems.
Similar to IFDS, IDE is a �xed-point algorithm. During the ESG construction, the corresponding
edge functions are composed, met and propagated, once the construction of the ESG is done, the
resulting edge functions are evaluated to yield the �nal la�ice values associated with each node of
the ESG, i.e. the environment. �e la�er process is known as Phase 2 of IDE.
Figure 2 shows an example that uses IDE to perform linear constant propagation (Sagiv et al.
1996). �e �gure depicts the ESG that IDE generates during its �xed-point iteration. For this
example, D is the set of local variables and the graph represents the �ow functions on D as straight
edges. Each data�ow fact d (the local variable) is shown at the top of the column where the node is
drawn. Nodes are placed between two statements, because each node represents a fact that holds
a�er and before a statement. To uniquely identify a node, we refer to the statement that is before it.
Constant propagation starts at assignments of constant integers to variables, here at line 6. �e
assignment v = u (line 7) transfers the value of u to v. �e data�ow fact v that holds before the
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:5
5 foo(){
6 int u = 0;
7 int v = u;
8 int w = bar(v);
9 }
u v w
0
+1
10 bar(int a){
11 b = a + 1;
12 return b;
13 }
Flow
Summarized Flow
IDE Edge FunctionFUNC
a b
+1
Fig. 2. Linear constant propagation modeled in IDE.
call to bar (line 8) �ows into callee bar. Upon termination of the analysis of bar, IDE stores the
summary information that a �ows to a and b. �is summary is then applied at the call site to bar.
In the �gure, this is highlighted by the squiggled edges labeled as summarized �ows. To complete
the linear constant propagation, IDE adds edge functions to compute the �nal environment. �e
environment of a linear constant propagation associates to each variable a set of integer values. To
the statement u = 0, IDE assigns the constant edge function λv .0, here denoted just by 0. Within
bar, the �ow from a to b at the statement b = a + 1 (line 11) receives the edge function λv .v + 1,
denoted by +1. �is edge function simply increases every incoming la�ice value by one. IDE
then promotes the edge function +1 to a summarized �ow of bar. �e summary of bar states that
value of variable a �ows to b and additionally increases the la�ice value by one. Eventually, IDE
composes the edge functions along each �ow, making use of the summaries where appropriate. In
the example, the linear constant propagation computes the environment that maps the variables u,
v and a to the value 0, and b and w to the value 1.
2.3 Boomerang: An All-Aliases Analysis
In any language involving pointer variables, data�ow can occur also indirectly, through reads and
writes from/to aliased memory locations. In Java this involves �eld reads and writes. To be sound
and precise, static analyses must handle such aliasing, i.e., they should resolve which pointers may
point to which memory locations. In this paper, we show that one can o�oad the entire tracking
of aliases from the IDEal
framework using an all-aliases analysis. To resolve aliases internally,
IDEal
builds on top of Boomerang (Spath et al. 2016), a demand-driven, context-sensitive and
�ow-sensitive pointer analysis. In contrast to traditional points-to and alias analyses (Sridharan
et al. 2005; Yan et al. 2011), for a given variable v , Boomerang determines the points-to set of v ,
and e�ciently computes all variables in the current scope that may point to the objects that this
points-to set refers to. Boomerang thus computes all aliases of v . Boomerang �rst computes a
backward pass to determine the points-to set of v , followed by a forward pass to determine all
aliasing variables (in the same scope as v , optionally �ltered by some calling context).
3 OVERVIEW OF IDEAL
We now provide an overview of IDEal
before explaining the internal details of the framework in
Section 4. �e work�ow of IDEal
consists of two consecutive phases: the object-�ow propagation
(Phase OF) and the value-�ow propagation (Phase VF). Both phases execute an instance of IDE, they
only di�er in their propagated edge functions. �e propagations of both phases start at the same
seed, a client-de�ned source node of the ESG. �e seed is a node ⟨s,p⟩ of an abstract pointer p at a
statement s .
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:6 Johannes Spath, Karim Ali, and Eric Bodden
3.1 Phase OF: Object-Flow Propagation
�e purpose of Phase OF is to follow the �ow of an object through the program. Instead of statically
abstracting an object by its allocation site, IDEal
uses access graphs (Ge�en et al. 2014; Khedker
et al. 2007). An access graph represents a local variable that is potentially followed by a regular
expression of �eld accesses (e.g., v.f+.g). IDEal
uses the fully quali�ed �eld names to distinguish
�elds with the same names but declared in di�erent classes. �e data-�ow domain D of IDEal
is
�xed to the set of all access graphs, referred to by A.
Given an access graph p ∈ A from the seed statement s , IDEal
derives all other access graphs qat any control-�ow successor t of s such that q at t is an alias to p at s . In other terms, the access
graph q may access the same object as p is pointing to. Tracking this �ow is challenging, because a
statement can introduce aliasing pointers directly and indirectly. A�er an assignment statement
of the form x = y, the pointers x and y alias directly. However, at �eld-write and call statements,
indirectly aliased access graphs may be introduced. To detect those indirectly aliased pointers,
Phase OF keeps track of points of aliasing (POAs), each of which executes a pointer query. From
the result of the pointer queries, IDEal
derives and propagates indirect pointer �ows. �e pointer
query is required to compute all aliases of the pointer p at the given statement s , which IDEal
uses Boomerang to compute. Phase OF uses IDE and models �ows of directly aliased pointers
using standard pointer-�ow functions (Section 4.1) that operate on A. We modi�ed the original
IDE algorithm to keep track of POAs that model indirectly aliased �ows automatically. As edge
functions, this phase �xes all functions to be the identity edge function.1
Due to this fact, Phase 2
of IDE is not executed in Phase OF.
By construction of our modi�ed IDE algorithm, all nodes ⟨t ,q⟩ that are propagated may point-to
the same object as the seed node ⟨s,p⟩. We refer to the collection of all those nodes as the object-
�ow graph (OFG). �e edges of the OFG originate from (1) the standard-�ow functions and (2)
the indirect �ows at the POAs. �e OFG is the ESG of our modi�ed IDE problem with the �xed
standard �ow functions. In this work, we only use the OFG for visualization purposes but it is
never constructed explicitly by IDEal
. A�er the �xed-point of IDE has been found, Phase OF is
complete and IDEal
executes Phase VF.
3.2 Phase VF: Value-Flow Propagation
�e goal of Phase VF is to compute IDE environments. I.e. associate a value of a client-de�ned
�nite-height la�ice to each of the nodes of the OFG. As each node of the OFG is a pair of an access
graph and a statement, Phase VF computes the la�ice value associated to the access graph at the
statement. For Phase VF, IDEal
triggers a second round of IDE, this time also Phase 2 of IDE is
executed. Phase 2 of IDE remains unchanged in IDEal
. For Phase VF, IDEal
uses the same standard
pointer �ow functions and the POAs that are used in Phase OF. In addition to the seed, Phase VF
requires the user to supply distributive environment transformers in the form of IDE edge functions
that describe the transformations of the la�ice values along the interprocedural path(s) of the OFG.
Note, the environment transformers can only be represented as edge functions, if the environment
transformers are distributive. �is holds in general for IDE and must also be guaranteed when
instantiating an analysis in IDEal
.
Clients that have already been shown to obey the distributivity property and therefore are
instantiable in IDEal
are linear constant propagation (Sagiv et al. 1996), more precise data�ows
in the presence of correlated calls (Rapoport et al. 2015) or the analysis of so�ware product
lines (Bodden et al. 2013).
1Phase OF simpli�es to an IFDS problem. For the sake of the presentation, we present it as an IDE problem.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:7
I O C
INIT
OPEN CLOSE
CLOSE
Fig. 3. The finite state machine for File objects.
During Phase VF, all POAs are known and their queries are computed. �e results of the pointer
queries at each POA determines whether or not a bypassing pointer’s value is strongly updated.
Even though both IDEal
phases perform an IDE �xed-point computation that is only distinguished
by the edge functions, IDEal
has to compute the IDE �xed-point twice. A necessity when one wants
to perform strong updates but propagate aliased pointers separately. We discuss this in Section 3.3,
where we also elaborate on a typestate analysis example instantiated in IDEal
.
3.3 Typestate Analysis in IDEal
In this section we show how to instantiate an IDEal
-based typestate analysis by de�ning a concrete
la�ice L and the respective environment transformers for a typestate analysis.
3.3.1 Instantiation. Typestate properties can be encoded in �nite state machines (FSM). �e
FSM for an object of type File is drawn in Figure 3. At the end of the lifetime of a File object, its
typestate must be in an accepting state C or I. Formally, a FSM for an object of type T has the form
A ∶= (Σ,S, s0,δ , F). Σ is the set of methods that may be invoked on an object of typeT changing the
state of the object, S the set of all possible states, s0 is the initial state, δ is the transition function
δ ∶S × Σ → S and F is the set of accepting states of the �nite state machine. �e la�ice for the
typestate analysis in IDEal
is the la�ice LA ∶= (P(S),∪), the powerset of the set of states S ordered
by set union. Concretely, we map each node of the OFG to a set of states of the FSM.
�e data�ow domain D in IDEal
is �xed to the set of all access graphs A and the environment
transformers concretizes to t ∶Env(A,LA)→ Env(A,LA). We are now able to provide the environ-
ment transformers and show their distributivity. �e state of an object of typeT may only change at
a call site that invokes a methodm ∈ Σ on a receiver of typeT . For any other statement, the environ-
ment transformer is identity. Assume the base variable on which the call is invoked to be a. For such
a call a.m, the environment transformer ta .m has the form λenv .env[a ↦ {δ(s,m) ∣ s ∈ env(a)}]. 2
�at is, the set of states env(a) associated with the access graph a before the statement is replaced
by the set of states that is constructed by performing the transitionm on each of the states within
env(a). Any la�ice value associated with any other access graph than a is maintained.
We now show that the environment transformer for each statement is distributive. Assume
env1,env2, ... ∈ Env(A,LA) are concrete environments and d ∈ A. Without loss of generality,
one can restrict the environment transformer to be of the form ta .m and d to be the access graph
representing the local variable a. In all other cases, the transformers are the identity functions that
obey distributivity. �en it holds that:
⊓i(ta .m(envi))(d) = ⊓i{δ(s,m) ∣ s ∈ envi(d)} = ∪i{δ(s,m) ∣ s ∈ envi(d)}
= {δ(s,m) ∣ s ∈ ∪ienvi(d)} = {δ(s,m) ∣ s ∈ ⊓ienvi(d)} = (ta .m(⊓ienvi))(d)
�is proves that the environment transformers are distributive and can be represented by edge
functions f ∶ LA → LA on top of the OFG edges. Hence, in the remaining of the paper we will refer
to the edge functions instead of the environment transformers. It is important to mention that the
2�e notation λenv .env[a ↦ S] means, that the la�ice value associated to the access graph a is the set of states S and
any other access graph maintains its values.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:8 Johannes Spath, Karim Ali, and Eric Bodden
14 foo(){
15 File a = new File();
16 b = a;
17 b.open();
18 a.close();
19 }
Direct Flow
Strongly Updated Flow
Indirect Flow
Point of Aliasing
IDE Functions
P
FUNC
OFG
a b
P
P
CLOSE 15↦ {a, b}
15↦ {a, b}
Fig. 4. An example illustrating how an IDEal -based typestate analysis checks for unclosed files.
environment transformers for the typestate analysis change only the typestate of the access graph
on which the call is invoked, but do not change the typestate of any other aliasing access graph
that refers to the same object. �is su�ces because IDEal
associates the resulting la�ice values
with all those aliasing access graphs automatically, by solving the appropriate POAs.
3.3.2 Example. Figure 4 shows an example of applying an IDEal
-based typestate analysis that
checks for unclosed �les within the program, i.e. the FSM drawn in Figure 3. �e typestate analysis
starts o� at the seed node ⟨17, b⟩, a potential source of a typestate violation as the File object may
remain open. Based on the seed, IDEal
starts the propagation in Phase OF and generates all the
nodes associated with b in the OFG. During this propagation, IDEal
registers ⟨17, b⟩ as the �rst POA.
IDEal
then issues an all-aliases query to Boomerang for b at line 17, yielding {15↦ {a, b}}. �is
result means that b at line 17 points to the object allocated at line 15, and that {a, b} are the only
pointers in scope at line 17 that point to this object. In other words, a is an alias to b. �erefore,
IDEal
adds the indirect-�ow edge from b to a to the OFG. �is newly created node unveils a new
part of the OFG that contains the two nodes associated with a.
IDEal
then discovers another POA ⟨18, a⟩ on the return from the call3
to close (line 18). �e
results of the pointer query is the same as before and the second indirect-�ow edge from ⟨18, a⟩ to
⟨18, b⟩ is added to the graph. Since ⟨18, b⟩ has already been propagated, IDEal
does not discover
any new nodes in the OFG, and it concludes the computations in Phase OF. In the �nal OFG, the
node ⟨18, a⟩ encodes that, a�er statement 18, a may point to the same object that b points to at
statement 17. IDEal
now executes Phase VF to propagate user-speci�c values from a la�ice L along
the �nal OFG.
IDEal
expects an initial la�ice value that is associated with the seed node for Phase VF. In the
example in Figure 4, we assign the singleton set with the state {O} of the FSM with the node ⟨17, b⟩.�e object referred to by b is in an open state at this point. �e typestate analysis assigns the edge
function f18 = CLOSE to the OFG edge from ⟨17, a⟩ to ⟨18, a⟩. �is edge function holds the CLOSE
transitions {I→ C, O→ C} of the FSM in Figure 3. Eventually, IDEal
composes and evaluates the
edge functions along all possible interprocedural paths of the OFG. �e initial la�ice element {O}at the seed node �ows via the indirect-�ow edge from b to a into the function f18. Applying f18
to the input {O} yields the state {C}. �e la�ice value {C} then �ows via the indirect-�ow edge
back to b, and IDEal
infers that b may be closed. To determine that the object will in fact be closed,
IDEal
must prevent the opened la�ice element represented by {O} from bypassing the call to close.
3For simplicity, we do not show the code of close() here, but IDE
alwill analyze it.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:9
Since IDEal
can infer that a and b must-alias, the �ow from ⟨17, b⟩ to ⟨18, b⟩ that bypasses the call
to a.close() is removed in Phase VF. �e killed �ow is shown by the green dashed edge in the
OFG of Figure 4. We say the �ow is strongly updated. �e strong update prevents the propagation
of the stale information that b remains opened. We will discuss how IDEal
determines must-alias
relations in Section 4.3. A�er the call a.close(), the la�ice value associated with ⟨17, b⟩ and ⟨18, a⟩is the singleton set {C}. Since {C} is an accepting state of the FSM, both a and b point to objects
that are eventually closed.
IDE is a chaotic �xed-point iteration, and the order in which nodes of the ESG are created is
non-deterministic. In Phase VF of IDEal
, �ows are killed depending on the existence of other nodes
in the graph. In the example, the value on variable b that bypasses the call a.close() is strongly
updated only if the POA ⟨18, b⟩ has been detected. �erefore, the iteration order is relevant for
IDEal
, which explains our design decision to execute two consecutive IDE phases.
4 FRAMEWORK DESIGN
In this section, we explain the details of the main algorithm in IDEal
. We de�ne the standard �ow
functions in IDEal
, and describe how we handle points of aliasing, perform sound strong updates,
and support context-sensitive alias queries.
As explained in the previous section, IDEal
uses a modi�ed version of IDE that keeps track
of POAs. To achieve that, IDEal
intercepts the IDE �xed-point iteration to detect indirect �ows
of pointers at points of aliasing, which are special nodes of the OFG. Depending on the node,
additional indirect �ow edges are created and injected into the worklist of IDE.
We base our description on the pseudo-code provided for the standard IDE algorithm by Sagiv et al.
(1996, p. 147). �e IDE algorithm maintains and extends path edges that describe the (summarized)
intraprocedural realizable �ows within the OFG. We write a path edge as d1 → ⟨s,d2⟩, the data�ow
fact d1 is the intraprocedural data�ow element that the OFG node ⟨s,d2⟩ originates from and is
called the source fact.
�e �xed-point iteration of IDE is controlled by a function called Propagate (Sagiv et al. 1996).
Figure 5 depicts the Propagate function and highlights the required changes for IDEal
. Due to
space restriction, we omit the full IDE algorithm here. In addition to the main worklist called
PathWorkList, IDE stores jump functions in the map JumpFn. �e jump functions map each path
edge to its edge function that describes how a la�ice value is transformed along the path edge. IDE
relies on the jump functions to decide if its �xed-point has been reached.
Within Propagate, we introduce two changes. First, a node of the OFG can be a point of aliasing.
In such a case, the necessary aliases are computed by a pointer analysis. �e call to computeAliases
(line 3) abstracts this construction, because the concrete aliases depend on the type of POA. For each
alias d3 of d2, a path edge with the same source fact d1 is created and propagated along with the
jump function f that reached node ⟨s,d2⟩. �e second change (lines 8-10) a�ects the propagation
during Phase VF. Some of the points of aliasing introduce strong updates. If a data�ow fact is
strongly updated, i.e., it receives the value of another aliased pointer, the �ow is killed and the
subsequent propagation is prevented.
Strong updates can only be performed in Phase VF, once all necessary alias queries have already
been performed (in Phase OF) and their results are known. �is enables IDEal
to propagate aliasing
pointers in a distributive manner, while maintaining strong updates.
4.1 Standard Flow Functions
IDEal
uses a set of �xed standard �ow functions to model the direct pointer �ow at statements.
�e interprocedural-�ow functions are straightforward. For each access graph whose base variable
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:10 Johannes Spath, Karim Ali, and Eric Bodden
1: procedure Propagate(d1 → ⟨s,d2⟩ ,f )
2: if isPointOfAliasing(⟨s,d2⟩) then
3: aliases = computeAliases(d1 → ⟨s,d2⟩)
4: for d3 ∈ aliases ∧ d3 ≠ d2 do
5: Propagate(d1 → ⟨s,d3⟩,f ) ▷ Indirect �ows through aliases
6: end for
7: end if
8: if Phase VF ∧ receivesStrongUpdate(d1 → ⟨s,d2⟩) then
9: return ▷ Strong update in Phase VF kills �ow.
10: end if
11: f ′ = f ⊓ JumpFn(d1 → ⟨s,d2⟩)
12: if f ′ ≠ JumpFn(d1 → ⟨s,d2⟩) then
13: JumpFn(d1 → ⟨s,d2⟩) = f ′
14: PathWorkList.add(d1 → ⟨s,d2⟩)
15: end if
16: end procedure
Fig. 5. The required changes in IDE for IDEal .
⟦x ← new⟧(⟪v, sEt⟫) =
⎧⎪⎪⎨⎪⎪⎩
∅ if v = x
{⟪v, sEt⟫}(1)
⟦x ← y⟧(⟪v, sEt⟫) =
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
{⟪x , sEt⟫,⟪v, sEt⟫} if v = y
∅ if v = x
{⟪v, sEt⟫}
(2)
⟦x ← y. f ⟧(⟪v, sEt⟫) =
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
{⟪v, sEt⟫} ∪ {⟪x ,G⟫ ∣G ∈ tail(sEt)} if v = y ∧ s = f
∅ if v = x
{⟪v, sEt⟫}
(3)
⟦x . f ← y⟧(⟪v, sEt⟫) =
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
{⟪v, sEt⟫} ∪ ⟪x , cons(f , sEt)⟫ if v = y
∅ if v = x ∧ s = f
{⟪v, sEt⟫}
(4)
Fig. 6. The intraprocedural normal-flow functions of IDEal .
matches an actual parameter of the call, including the call receiver, the call-�ow function simply
replaces the base by its corresponding formal parameter. Similarly, at return sites, the return-�ow
function maps the base of an access graph back to the corresponding actual argument, while also
replacing the returned variable with the assigned variable (if applicable).
IDEal
operates on a three-address code, an intermediate representation where statements contain
at most one �eld dereference. Table 1 lists the statements that IDEal
handles, and Figure 6 provides
the de�nitions of the corresponding intraprocedural normal-�ow functions. Each function ⟦w⟧(α)maps an access graph α at statement w to a set of access graphs that hold a�er statement w . We
represent an access graph as ⟪x , sEt⟫, where the base x is a local variable, and sEt is a �eld graph.
A �eld graph is a directed graph whose nodes are �elds and is uniquely identi�ed by the edge set Eand two special nodes, the �rst access s and the last access t .
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:11
Table 1. Three-addressed code that IDEal handles.
Statement Notation
Allocation site x ← new
Assign statement x ← y
Field read statement x ← y.f
Field write statement x.f ← y
If statement if(x ≠ null) goto l
Call site m(p)
Equation (1) in Figure 6 shows the �ow function for an allocation site x ← new . IDEal
kills each
access graph of the form ⟪x , ⋅⟫, due to the re-assignment of the base x .
Equation (2) describes the �ow of access graphs for a local-assignment statement x ← y. For
access graphs of the form ⟪y, sEt⟫, IDEal
generates the aliasing access graph ⟪x , sEt⟫, which is the
result of assigning y to x . Similar to allocation sites, IDEal
kills access graphs of the form ⟪x , ⋅⟫.
IDEal
preserves the �ow for all other cases.
Equation (3) de�nes the �ow function for a �eld-read statement x ← y. f . �e �rst case handles
access graphs of the form ⟪y, f Et⟫, where the edge set E and �eld t are arbitrary. For each edge
(f ,д) ∈ E, for some �eld д, IDEal
uses the tail operation to compute the �eld graph дEt , where
E = E ∖ {(f ,д)}, generating the access graph ⟪x ,дEt⟫. IDEal
kills access graphs of the form ⟪x , ⋅⟫,
as shown in the second case of the equation. For all other cases, IDEal
maintains the access graph
⟪y, f Et⟫.
Equation (4) handles a �eld-write statement x . f ← y. IDEal
calls cons(f , sEt) to update each
access graph ⟪y, sEt⟫ that reaches the �eld-write statement by making the �eld f its head. Formally,
the cons operation computes the �eld graph f Et , where E = E ∪{(f , s)}. IDEal
then constructs and
propagates the new access graph ⟪x , cons(f , sEt)⟫, as well as ⟪y, sEt⟫. At the �eld-write statement,
IDEal
kills any access graph ⟪x , f Et⟫ with an arbitrary edge set E and a last �eld access t , because
y overwrites the �eld f of x .
�e described standard pointer �ow functions are used as defaults in IDEal
, but they can be
adjusted by the client. For example, a taint analysis may model �ows through string concatenations.
4.2 Points of Aliasing
IDEal
resolves points of aliasing by issuing alias queries to Boomerang (Spath et al. 2016) and uses
the result: all aliasing access graphs for a given pointer dereference. We denote a points-to query
that IDEal
issues to Boomerang by BR(s,α), and it comprises a statement s and an access graph α .
�e result of the query is a set of pointer elements. A pointer element has the form (t ,At ), where tis an allocation site and At is the set of all access graphs β for which t ∈ points-to(β) holds at s . �e
access graph α , for which the query is issued, is an element of the set At . We write β ∈ BR(s,α)when α and β may alias at s . �at means, there exists a pointer element (t ,At ) such that β ∈ At .
We now discuss which nodes of the OFG are points of aliasing and the aliases they compute. A
POA can either be [Write], [Return], or [Null], and depending on this type, the computed aliases
di�er. In terms of the algorithm of Figure 5, we describe the result of the function computeAliases.
In the following, let us assume d1 → ⟨s,d2⟩ to be the path edge that is currently propagated, i.e., the
argument to computeAliases.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:12 Johannes Spath, Karim Ali, and Eric Bodden
20 foo(){
21 A a = new A();
22 A b = a;
23 bar(a);
24 File d = b.h;
25 d.close();
26 }
27 bar(A c){
28 File f = new File();
29 c.h = f;
30 f.open();
31 }
OFG
a.h b.h d
R
R
CLOSE
28↦ {a.h, b.h, d}
28↦ {a.h, b.h}
Direct Flow
Strongly Updated Flow
Indirect Flow
[Return] POA
IDE Functions
R
FUNC
OFG
f c.h
R
28↦ {c.h, f}
Fig. 7. An example illustrating how IDEal resolves multiple [Return] POAs and performs sound strongupdates.
4.2.1 [Write]. A [Write] is a node ⟨s,d2⟩ in the OFG, where s is a �eld-write statement x . f ← y,
and d2 is an access graph ⟪x , f Eд⟫ with arbitrary E and last access д. Each alias of x generates
an aliasing access graph. IDEal
computes those aliases by �rst issuing the query BR(s,⟪x ,∅⟫) to
compute all aliases of the access graph ⟪x ,∅⟫ before s . IDEal
then concatenates the �eld graph
f Eд to each aliasing access graph ⟪z,vFw⟫ ∈ BR(s,⟪x ,∅⟫). �is creates the new access graphs
d3 = ⟪z, conc(vFw, f Eд)⟫. �e operation conc yields the �eld graphvEд, where E = E∪F∪{(w, f )}.
�e set of aliases that is returned by computeAliases for a [Write] is the set of all d3s. For each
d3, IDEal
then injects the new path edge d1 → ⟨s,d3⟩. When we draw the OFG, we highlight the
injection of such an edge as an indirect-�ow edge from ⟨s,d2⟩ to ⟨s,d3⟩.
4.2.2 [Return]. A [Return] is a node ⟨s,d2⟩ in the OFG, where s is a method callm(p), and d2
is an access graph ⟪p, f Eд⟫. When the callee returns the access graph to a call site, there might be
local aliasing access graphs in the caller’s scope that point-to the same abstract object. For example,
an alias to p may exist prior to the call site s , and instead of using ⟪p, f Eд⟫ to access the object of
interest, the aliased pointer may be used.
Similar to the treatment of [Write], IDEal
computes all aliases to the pointer d2 = ⟪p, f Eд⟫,
denoted by d3 ∈ BR(s,d2). IDEal
then adds indirect-�ow edges from ⟨s,d2⟩ to each node ⟨s,d3⟩
by propagating d1 → ⟨s,d3⟩. As the algorithm in Figure 5 shows, the same edge function that is
propagated to ⟨s,d2⟩ is also used for this indirect aliasing path edge with target node ⟨s,d3⟩. By
doing so, the additional indirect-�ow edges in Phase VF maintain the la�ice value �ow from d2
and propagate the value to each alias d3 a�er the call. In the case of a typestate analysis, this
value �ow transfers a possible state change on the access graph ⟪p, f Eд⟫ within the callee to all of
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:13
its aliasing access graphs within the caller. For example, in Figure 7, at the call to open (line 30),
IDEal
starts tracking the pointer f that references a File object and the pointer then �ows into the
callee open. At the return to bar, IDEal
registers a [Return] that yields [28 ↦ {c.h, f}] and adds
an indirect-�ow edge from ⟨30, c.h⟩ to ⟨30, f⟩. Since c is a parameter of bar, the access graph c.h
escapes to foo via the call site at line 23. Since c.h corresponds to a.h in foo, IDEal
creates the
node ⟨23, a.h⟩ and registers it as another [Return] that yields [28↦ {a.h, b.h}]. IDEal
then adds
the appropriate indirect-�ow edge to the OFG. Without this indirect-�ow edge, IDEal
misses that
the tracked object is loaded from b.h into to d in the next statement. A�er the call to close on d
(line 25), IDEal
registers another [Return] that yields [28↦ {a.h, b.h, d}] and adds indirect-�ow
edges to the nodes ⟨25, a.h⟩ and ⟨25, b.h⟩.
4.2.3 [Null]. A [Null] is a node in the OFG that corresponds to a comparison to null in the
program. To avoid throwing a NullPointerException in Java, it is common to check for nullness of
variables before accessing them. �ese checks lead to branches in the control �ow. IDEal
handles
[Null] to avoid imprecise propagations along the branches where IDEal
knows that the object can
never be null. We discuss this in an example in the next subsection.
4.3 Sound Strong Updates
During Phase VF, IDEal
performs sound strong updates for the client’s analysis information. To
achieve that, data�ows are killed in Phase VF, as shown by the early termination of the Propagate
that depends on the call to receivesStrongUpdate (Figure 5, lines 8-10). A created node of the
OFG receives a strong update i� there is an indirect �ow to that node from a POA and the POA’s
pointer query returned a unique pointer allocation. In such situations, the access graphs in question
can only point to a single abstract object, and thus must point to the same one. For example, in
Figure 7, IDEal
labels the initial seed node ⟨30, f⟩ with the initial la�ice element {O}, because the
tracked object is initially in an OPEN state. �is la�ice element �ows along all the OFG paths (as
they are labeled by the identity edge function) and reaches the call to close at line 25. At this call,
it is sound to perform a strong update for the aliases a.h and b.h, because [Return] at ⟨25, d⟩ yields
a single allocation site. Since a.h and b.h are not involved in the call d.close(), but abstract the
same object, IDEal
kills the call-to-return �ows in Phase VF, i.e. those path edges receive a strong
update. �is correctly prevents the O state from bypassing the call. Instead, IDEal
transfers the
typestate information from ⟨25, d⟩, via the indirect-�ow edges, to the nodes ⟨25, a.h⟩ and ⟨25, b.h⟩,
and an IDEal
-based typestate analysis will report that the File object pointed to by a.h and b.h is
eventually closed.
�e condition to perform a strong update (“singleton points-to set”) is insu�cient in cases where
the allocation site is within a loop or a recursive part of the program. Such re-entrant allocation sites
may represent multiple concrete runtime objects by a single abstract object. To ensure soundness
for these cases, IDEal
con�gures the underlying pointer analysis to treat assignments of the form x
= null as allocation sites as well. Due to the semantics of Java, a null-assignment outside the loop
must exist, and the pointer analysis will pick it up, causing IDEal
to avoid an otherwise incorrect
strong update. Figure 8 shows a code snippet with an allocation site inside a loop (line 35). At
line 36, the �le is opened by a call to a.open(). In the if branch, the �le is assigned to b. When the
branches join again, b.close() is called. When b returns from the call to close, IDEal
registers a
[Return], computing all aliases to b at line 40: [33↦ {b}, 35↦ {a, b}]. �e pointer b may point to
two di�erent abstract objects, making a strong update of the pointers to these objects unsound.
IDEal
therefore does not strongly update the call-to-return �ow to node ⟨40, a⟩. Following the �ow
path and tracking the object states, the typestate analysis computes that the object that b points
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:14 Johannes Spath, Karim Ali, and Eric Bodden
32 foo(){
33 File b = null;
34 while(...){
35 File a = new File();
36 a.open();
37 if(...){
38 b = a;
39 }
40 b.close();
41 }
42 }
Direct Flow
Not Strongly Updated Flow
Indirect Flow
[Return] POA
IDE Functions
R
FUNC
OFG
a b
R
R
CLOSE
35↦ {a}
33↦ {b}35↦ {a, b}
Fig. 8. An example illustrating that IDEal does not perform unsound strong updates in loops.
to is de�nitely closed. However, the object pointed to by a is either closed or opened. �is result
matches the possible runtime execution where b may refer to an object from an earlier iteration of
the loop and a currently points to a di�erent object that is not closed.
Strong updates are also used at [Null] to prevent �ows along invalid branches. If IDEal
creates
an OFG node ⟨s,⟪x ,∅⟫⟩ with a statement s ∶ if(x ≠ null), it registers a [Null] and issues an alias
query BR(s,⟪x ,∅⟫). If the access graph ⟪x ,∅⟫ has only one allocation site that di�ers from null,
all computed aliasing access graphs that bypass the condition check are also di�erent from null.
�erefore, IDEal
avoids the propagations of these pointers to the branch where x is null. In Phase
VF, IDEal
strongly updates the path edges with targets ⟨t , β⟩, where t is the false-successor of the
if statement and β ∈ BR(s,⟪x ,∅⟫). We showcase this behavior in the example in Section 5.
4.4 Context-Sensitive Alias�eries
While IDEal
inherits its context-sensitivity from IDE, Boomerang is a context-sensitive pointer
analysis by design. To achieve optimal precision and performance, IDEal
matches, for each alias
query, the contexts it provides to Boomerang to its own IDE contexts. IDE achieves context-
sensitivity by maintaining the source fact for each node created in the OFG, which is why IDE
propagates path edges instead of the plain OFG nodes. IDEal
consults the source fact and the
incoming relationship of IDE/IFDS (Naeem et al. 2010) to match the call site(s) once an OFG node
reaches a return site. IDEal
supplies Boomerang with the currently considered IDE contexts to
limit its alias queries to these given contexts. For example, when IDEal
issues an alias query for
an access graph that originates from some parameter to the current method, it limits the scope of
Boomerang to search for allocation sites only within the callers of the current method that the
IDE solver visited, but no other callers. Considering other callers may yield spurious aliases that are
impossible along any actual runtime execution path of the program. In the following section we
give an example of the explained context-matching.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:15
43 foo(){
44 A b = new A();
45 A c = b;
46 File a = new File();
47 a.open();
48 bar(a,b,c);
49 }
50 bar(File u, A v, A w){
51 v.f = u;
52 File x = w.f;
53 if(x != null){
54 x.close();
55 }
56 }
OFG
a b.f c.f
R
46↦ {a}
Direct Flow
Strongly Updated Flow
Indirect Flow
[Return] POA
[Write] POA
[Null] POA
IDE Functions
R
W
N
FUNC
u v.f w.f x
W
N
R
CLOSE
44↦ {v, w}
46↦ {u, v.f, w.f, x}
46↦ {u, v.f, w.f, x}
Fig. 9. A complete run of IDEal .
5 COMPLETE RUN
Figure 9 shows an example of a complete run of IDEal
. In the example, the File object that is
opened at line 47 represents the seed that the client provides. �is seed �ows as an argument
to bar (line 48) and is assigned to v.f (line 51), registering a [Write] at the node ⟨51, v.f⟩. �e
source fact of this POA is u, because it is the intraprocedural origin of the propagation. IDEal
then
uses Boomerang to compute the aliases of v at line 51, but limited to the calling contexts that
initiated the propagation. �e source fact u of the POA is a parameter of bar and during the pointer
query for variable v a calling context of bar is requested by Boomerang. IDEal
uses the incoming
relationship of IDE to limit the scope of Boomerang to only the call site 48, the call site where the
abstract pointer u entered bar. �erefore, the result of the alias query is [44↦ {v, w}]. �is �ltering
of calling contexts avoids computing aliases if there were any other callers of bar.
Due to the new aliasing pointer w, an indirect �ow to w.f occurs at the �eld write, and IDEal
adds
an indirect-�ow edge from ⟨51, v.f⟩ to ⟨51, w.f⟩ to the OFG. �e derived transitive �ows discover
two more POAs: a [Null] at node ⟨52, x⟩ and a [Return] at node ⟨54, x⟩. �ese POAs trigger two
all-alias queries for variable x. For both queries, the source fact is again u, and when Boomerang
requests more calling context, it is still the call to bar at line 48. Eventually, both queries yield the
result [46 ↦ {u, v.f, w.f, x}]. �e [Return] creates the do�ed indirect-�ow edges that originate
from ⟨54, x⟩ in Figure 9. Since in both cases, for [Return] and for [Null], the size of the points-to
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:16 Johannes Spath, Karim Ali, and Eric Bodden
set is one, IDEal
performs a strong update. All other �ows with the same target OFG nodes are
killed in Phase VF.
Finally, in Phase VF, IDEal
computes the la�ice values for all OFG nodes. Following all existing
paths in the constructed OFG, IDEal
deduces that the last action performed on each pointer is the
call to close. �erefore, the abstract object allocated at line 46, which is accessible via a, b.f, and
c.f in foo and u, v.f, w.f, and x in bar, is always closed at the end of its lifetime. �is matches the
runtime behaviour. If IDEal
did not handle [Null] POAs, it would have used the identity function
for the else branch within bar, and incorrectly deduced that the tracked File object may remain
open. However, IDEal
treats the null check precisely, inferring that the �le object cannot possibly
bypass the call to close.
6 EVALUATION
We evaluate IDEal
through a typestate client analysis. We refer to this client by TSal
and compare it
to the typestate analysis by Fink et al. (2008), referred to by TSf. TS
fis a staged analysis where less
precise stages restrict the overhead for more sophisticated analyses in later stages. All stages are
based on the IFDS framework. �e last stage of TSf
has a similar precision to TSal
, which enables
us to directly compare the IFDS propagation statistics of TSf
to that of TSal
. Another important
di�erence between the two analysis is their heap model. To group aliased pointers and track their
shared typestate, TSf
uses a powerset abstraction as their data�ow domain. Our experiments
empirically evaluate the characteristics of IDEal
by addressing the following research questions:
● RQ1: What is the e�ect of the heap models TSal
and TSf
on the generated ESG?
● RQ2: How does TSal
perform on large programs when compared to TSf
in terms of
computation time?
● RQ3: How precise are the analysis results reported by TSf
and TSal
?
● RQ4: What impact do aliasing and strong updates have on TSal
?
In addition to those research questions, this section completes with a case study that shows how
IDEal
can be used to realize an e�cient and precise analysis that detects insecure cryptographic
API usages in Android applications.
6.1 Setup
Our framework IDEal
, and hence TSal
, is based on Soot4
and relies on the IDE solver Heros5. �e
analysis TSf
is publicly available6
and is based on WALA7.
Both typestate analyses can verify typestate properties that are encoded as state machines.
Table 2 lists the typestate properties that we want to enforce. �e corresponding state machines
are available as part of TSf
analysis. TSal
maps them to the respective edge functions for IDEal
.
For all experiments, the analyses pre-compute a 0-1-CFA call graph for the given programs. For
each program, we also analyze its library dependencies, including the Java Runtime Library. Given
that TSf
is built on top of WALA and TSal
is built on top of Soot, we carefully con�gured Soot and
WALA to enable a fair comparison for both analyses. All experiments are run on a modern laptop
computer with a 2.9 GHz Intel Core i7 processor running Java 1.8.0 73. For each run, we allocated
a maximum of 6 GB of JVM heap space.
4h�ps://github.com/Sable/soot
5h�ps://github.com/Sable/heros
6h�ps://github.com/tech-srl/safe
7h�p://wala.sourceforge.net/
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:17
Table 2. Typestate properties represented in the micro benchmark by Fink et al. (2008).
Name Description
Vector Never try to retrieve an element of an empty vector.
Iterator Always call hasNext on an iterator before receiving the next element.
URL Never set options on an already connected URLConnection.
IO Do not read or write to a closed Stream or Writer.
KeyStore Always initialize a KeyStore before using it.
Signature Always follow the phases of initialization of a Signature.
Table 3. Comparing the e�iciency of TSf2, TSf
3and TSal in terms of IFDS propagations and the number of
visited methods.
ESG Nodes Visited Methods
Typestate # Programs TSf2
TSf3
TSal
TSf2
TSf3
TSal
Vector 30 3,451 2,058 580 66 49 13
Iterator 17 507 456 171 25 20 8
URL 2 8,510 8,488 115 308 287 4
IO 14 3,161 1,005 283 96 39 8
KeyStore 3 2,732 92 45 91 8 3
Signature 6 35,175 94,537 356 1,130 1,112 8
�e default con�guration of TSf
performs three analysis stages. During our initial experiments,
we discovered that the �rst stage does not report any typestate violation, preventing any computa-
tion of subsequent stages. We consulted with the authors of TSf
who con�rmed this behaviour and
were unable to �x the problem. �erefore, to compute meaningful results for TSf, we deactivated
the �rst stage. �e remaining two stages are the Unique Veri�er (TSf2) and the APFocus (TS
f3) (Fink
et al. 2008) that we base our evaluation on.
6.2 RQ1: Heap Models
In the heap model of TSf3, each data�ow element consists of (1) the allocation site of the tracked
instance, (2) a set of must-aliased pointers to that allocation site, (3) a completeness �ag indicating
whether the set of must-aliased pointers is complete, and (4) the state the object is currently in.
To inspect the di�erences between the heap models of TSf
and TSal
, we ran both analyses on
a set of micro benchmarks that consists of 72 sample programs. �ese programs ship with the
implementation of TSf. �e micro benchmarks contain typestate violations with aliasing and strong
update scenarios that challenge typestate analysis. Hence, it is a good baseline for the comparison.
We have applied both TSf
and TSal
to check for the typestate properties in Table 2 in the
micro-benchmark programs. Table 3 summarizes our �ndings. In the table, the columns for ESG
nodes gather the number of OFG nodes in TSal
, respectively the number of nodes in the exploded
supergraph for TSf. Visited Methods is the number of di�erent methods those nodes belong to, i.e.
visited by the underlying solvers. All numbers are arithmetic means taken over all input programs
(column # Programs) of the micro benchmark. Table 3 shows the statistics for TSf
according to the
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:18 Johannes Spath, Karim Ali, and Eric Bodden
57 foo(){
58 A a = new A();
59 b = a;
60 B c = new B();
61 a.f = c;
62 B d = b.g;
63 bar(d);
64 }
Direct Flow
Indirect Flow
[Write] POAW
TSal
OFG
c a.f b.f
W
TSf3
⟨60,{c}, true⟩
⟨60,{c, a.f}, false⟩
⟨60,{c, a.f}, false⟩
⟨60,{c, a.f}, false⟩
Fig. 10. An example illustrating the di�erences between TSf3and TSal with respect to the structure of dataflow
facts.
two analysis stages TSf2
and TSf3. If TS
f2
proves that the program is error free, TSf3
is not invoked.
On the benchmark, in 14 out of the 72 programs TSf3
was not invoked.
Since TSf3
is the most precise stage in TSf, we only compare TS
alto this stage in the following
discussion. �e numbers for TSf2
are similar. Across all the micro-benchmark programs, TSf3
requires a geometric mean of 10.4× more nodes, and analyzes approximately 10.3× more methods
compared to TSal
.
For the typestate property KeyStore, stage TSf3
only creates 92 nodes. For two of the three
KeyStore programs, the stage TSf2
proves the absence of a typestate violation and the stage TSf3
is never executed. For the typestate properties URL and Signature, TSal
requires less than 1.4%
and 0.4% of the propagations that TSf3
requires, respectively. For URL, TSal
starts from the call
sites to the method connect of any URLConnection object and reports an error once a method that
sets an option on the object is invoked. In contrast, TSf3
starts earlier at the allocation site of the
URLConnection itself. TSf3
records aliases only during forward propagation, while TSal
gets the
automatic support from IDEal
to detect aliases before the seeds by issuing the appropriate alias
queries to Boomerang. IDEal
evaluates those queries on demand, requiring fewer propagations.
�e same reasoning applies to the typestate property Signature.
Table 3 shows that there is a big di�erence in the number of created nodes in TSf
and TSal
. �is
di�erence originates from the heap models that the underlying framework of each analysis uses.
Each element of the data�ow domain in IDEal
consists of an access graph. �e access graph’s base
is a local variable associated with a declaring method. An access graph has only to be propagated
within that declaring method. �e data�ow abstraction used by TSf
cannot make use of this
additional information. Figure 10 illustrates an example. We ignore the propagated typestate
property to simplify the example. Assume both TSal
and TSf3
track the object that is created at
line 60. A�er line 61, TSf3
propagates the abstraction ⟨60,{c, a.f}, false⟩. �e completeness �ag is set
to false because a�er the �eld write statement, the tracked object is also accessible via the pointer
b.f, which is not in the set of musted-alias pointers ({c, a.f}). Since the representation does not
explicitly store b.f, TSf3
has to assume that the tracked object could also be accessed in method
bar (called at line 63), although no appropriate pointer ever escapes to the method. �erefore, TSf3
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:19
Table 4. Analysis time for running TSal and TSf3on the DaCapo benchmark programs.
Vector IO Iterator
antlr
bloat
chart
eclipse
luindex
lusearch
pmd
antlr
bloat
chart
eclipse
luindex
lusearch
bloat
pmd
Seeds 2 6 1 10 12 1 10 1 12 4 1 1 1 14 1
Average Time
per Seed (s)
TSf3
Boomerang
TSal
5.3
18.4
7.4
2.7 5.8
5.9
16.7
4.9
11.3
2
4.2 7
30
30
3.7
27.7
29.3
3.5
4
5.2
1.3
23.3
30
2.9
2.5
2.7
20.5
5.9
15.2
1.2
Timeouts- - 2 2 - - 1 - 1 - 1 - 10 2 - - 9 8 4 3 - - - - - - 7 6 - -
propagates the data�ow fact ⟨60,{c, a.f}, false⟩ into bar, needlessly increasing the number of ESG
nodes. On the other hand, TSal
does not propagate any data�ow facts into bar, as the object of
interest cannot be accessed from variable d, the only variable that escapes to bar. �e OFG drawn
in Figure 10 shows that no node for variable d is ever created. TSal
completely skips the analysis of
bar.
When compared to TSf, the more precise data�ow facts of IDE
aland their individual
propagation of aliased pointers enables TSal
to analyze smaller, yet relevant, parts of the
program.
6.3 RQ2: Performance
In this research question, we compare the analysis time performances of TSf
and TSal
on larger,
real world programs. We setup both analyses with programs from the 2006 DaCapo benchmark
suite (Blackburn et al. 2006) and run all of the typestate properties from RQ1 on the programs. We
exclude the benchmark fop from our evaluation as it lacks some dependencies that are required to
perform static analysis. On the remaining benchmarks, we executed both analyses and compare
the results on basis of the exact same seeds. Given the di�erent analysis frameworks TSf
and TSal
depend on, the call graphs they use are not completely identical, and not exactly the same parts of
the program are reachable. As seeds, both analysis only consider allocation statements of objects of
interest that are reachable in the underlying call graph. We report results only on the set of seeds
that are consistent in both analyses. For the veri�cation of each seed, we limit the execution time
of the analysis to 30 seconds. �is is necessary as we aim for analyzing the complete program,
and in some cases, data�ow occurs through Java collections, such as HashSet or List, or massive
Visitor pa�erns that remain hard to analyze.
For a fair comparison with TSf, we con�gured TS
alto always start at allocation sites of objects.
�is neglects the full potential of IDEal
. For the typestate property IO for instance, it is su�cient to
begin the analysis only at calls to close and report an error once a subsequent read is invoked on a
FileReader. Based on the result from RQ1 and the fact that any of the preceding stages of TSf
can
also be used for TSal
, in RQ2 we limit the comparison to stage TSf3.
Table 4 reports the result of this experiment. We cannot report any results for the properties URL,
KeyStore, and Signature, because the DaCapo benchmark has no uses of KeyStore, occurrences
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:20 Johannes Spath, Karim Ali, and Eric Bodden
Table 5. Comparing the reported errors of TSf and TSal .a) Correct dataflow but no typestate violation.
True Positives False Positives False Negatives
Benchmark TSf
TSal
TSf
TSal
TS−SU
TSf
TSal
TS−al
Micro Benchmark 42 48 1 1 +3 6 0 +2
DaCapo 11a)
11a)
0 0 0 0 0 +6
of Signature, or seeds to URL. �e table shows the total number of seeds per benchmark program
per typestate property. �e table also contains a bar plot, above its bars, we depict the arithmetic
means over the analysis times per seed. �e reported times are averaged across �ve independently
repeated runs. Additionally, below the bar chart, we report the number of seeds that timed out.
Except for 5 out of the 15 analyzed combinations, TSal
is faster than TSf. TS
altends to be slower
when verifying the Vector property. On geometric average across the Vector typestate, TSal
is 1.1×
slower than TSf. On the other two typestate properties IO and Iterator, TS
alis faster by a ratio
of 1.5× and 2.6×, respectively. �e analysis times do not re�ect the results from RQ1 as the two
analyses di�er in a major point. TSf
relies on a points-to analysis that has been computed for the
call graph. On the other hand, TSal
, or rather IDEal
, uses the on-demand alias analysis Boomerang
to �nd aliases. �e time required to process the alias queries is included in the analysis time of
TSal
. Across all programs, the time spent to compute the alias queries represent a major part of the
total analysis time (55%). In the bar chart in Table 4, we highlight the overhead time from using
Boomerang in TSal
. Nevertheless, including the overhead of the demand-driven alias analysis
time, across the DaCapo benchmarks and all typestate properties, TSal
is 1.3× faster than TSf.
On the DaCapo benchmark programs, including all their library dependencies, a typestate
analysis based on our framework IDEal
is slightly more e�cient than a state-of-the-art
typestate analysis.
6.4 RQ3: Precision
�e authors of TSf
have annotated all the programs of the micro-benchmark from RQ1 with the
ground truth, stating which results a precise and sound typestate analysis should report. Based on
that ground truth, in Table 5, we show the three results: true positives, the valid expected typestate
errors; false positives, the spurious reports that do not re�ect typestate errors; and false negatives,
missed typestate errors that should have been detected.
On the micro-benchmark from RQ1, TSf
reports one false positive on the Vector typestate. For
this program, the �ow-insensitive alias analysis in TSf
computes an alias set that has a spurious
element, leading to imprecise results. In addition to that false positive, TSf
misses some �ows,
because it does not detect some of the typestate violations for the typestate properties Vector,
Iterator, and IO. Further investigation shows that TSf
incorrectly handles array accesses and �ows
from thrown exceptions to the corresponding catch clause. However, we believe that this is more
of an implementation issue in TSf
rather than a conceptual problem.
TSal
reports one false positive on the micro-benchmark, but does not miss any report from the
ground truth results. �e false positive stems from a hasNext call within a synchronized block. �e
control-�ow graph that Soot creates contains edges from each statement within the synchronized
block to a catch block outside the block. Due to those exceptional edges, the hasNext call might be
missed and an error is reported.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:21
On the DaCapo benchmark suite, we manually inspected the reported errors from the analyses
performed in RQ2 to assess precision and recall. We restricted our inspection to the seeds for which
TSal
reported errors, but also to those that �nished within the given time budget of 30 seconds. In
total, 11 typestate violations are reported. All the errors are reported for the typestate property
Vector: one is reported in eclipse, the other 10 are located in luindex. Further investigation
shows that all of the identi�ed and tracked objects may be in an error state, i.e. an element of
the Vector may be accessed before any element was added to the Vector. However, our manual
inspection unveiled that the accesses to the elements are guarded by appropriate size checks on
the Vector. �is means, at runtime, a typestate violation cannot occur. �is shortcoming is due to
the weakness of the expressiveness of the typestate pa�ern: the size checks cannot be modeled
within the typestate machine, and is not an artifact of the over-approximations of IDEal
nor its
data�ow domains. In other words, there are potentially valid data�ow connections that lead to
those reports, but they are overcome by additional information that the analysis is not designed
to track. �erefore, we classify the reported errors as true positives (with respect to the tracked
typestate pa�ern).
An IDEal
-based typestate analysis is as precise as an existing typestate analysis.
6.5 RQ4: E�ect of Aliasing and Strong Updates
In addition to comparing both analyses, we now discuss the impact of handling aliasing and strong
updates for the typestate client analysis TSal
. Based on the same setup used in the earlier research
questions, we run TSal
with two additional con�gurations. For one run, we ignore the alias queries
completely (denoted by TS−al
). In the other con�guration, typestate updates are not strongly
updated (denoted by TS−SU
). Both con�gurations allow us to compare the e�ect of aliasing and
strong updates on TSal
. In general, turning o� strong updates introduces additional false positives,
but does not impact the number of false negatives. Strong updates only kill spurious typestate
�ows. In contrast, in the con�guration TS−al
, when alias queries are disabled, false negatives are
expected as the data�ow path(s) may not be complete.
Table 5 additionally depicts the analysis when run in the con�gurations TS−al
and TS−SU
.
Column TS−SU
lists the false positives that are added when strong updated are disabled, and TS−al
lists the false negatives that are missed when aliasing is disabled. In addition to the original false
positive that TSal
reports, disabling strong updates causes TS−SU
to report three additional false
positives for the micro-benchmark. On the micro-benchmark, aliasing is only required for the
Vector programs where data�ow occurs through �elds. �erefore, two of the ground truth results
require the typestate analysis to soundly handle aliasing.
�e results of this experiment on DaCapo di�er slightly. When we disable aliasing for TSal
, more
than 50% of the errors are not reported anymore. For all of those cases, the tracked objects are
stored inside �elds of other objects and are accessed indirectly within other methods through the
�elds. Such data�ows require aliasing information about the parent object, the object that the �eld
resides within, information that is missing when aliasing is disabled. On the other hand, disabling
strong updates on DaCapo programs does not report any false positives on the inspected seeds.
While strong updates marginally improve precision on the micro-benchmarks, on the
DaCapo programs, handling aliases has a higher in�uence and is required to obtain sound
results.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:22 Johannes Spath, Karim Ali, and Eric Bodden
65 public class Encrypter{66 private SecretKey key;67 private int keyLength = 448;68
69 public Encrypter(){70 KeyGenerator keygen = KeyGenerator.getInstance("Blowfish");71 keygen.init(this.keyLength);72 this.key = keygen.generateKey();73 }74
75 public byte[] encrypt(String plainText){76 Cipher cipher = Cipher.getInstance("AES");77 cipher.init(Cipher.ENCRYPT_MODE, this.key);78 byte[] encText = cipher.doFinal(plainText.getBytes());79 return encText;80 }81 }
Fig. 11. An example of a broken cryptographic implementation.
6.6 Case Study: A CryptoAnalyzer built on top of IDEal
In the preceding research questions, we evaluated IDEal
by a qualitative comparison of an IDEal
-
based typestate analysis to the analysis TSf
implemented in WALA. We further want to demonstrate
the strength of IDEal
by applying it to another concrete client analysis, an analysis that veri�es
the correct usage of the Java Cryptographic Architecture (JCA). Cryptography is a complex topic
on its own. In combination with an ambiguous API design of the JCA, multiple researchers have
shown that cryptography is commonly implemented incorrectly (Egele et al. 2013; Nadi et al. 2016).
Improperly used cryptography frequently leads to data leaks.
Figure 11 demonstrates an example of a misuse of a cryptographic cipher that originates from
the javax.crypto.Cipher class of the JCA. �e code snippet shows a class that is supposed to be
used for encryption. When an object of type Encrypter is instantiated, the constructor generates a
SecretKey of size 448 for the Blow�sh algorithm. �en in line 72, the generated key is stored as �eld
key of the constructed instance. Calling the encrypt method creates a Cipher object in line 76, but
for the AES algorithm. �e next line initializes the algorithm’s mode to encryption and supplies the
SecretKey that is stored in the �eld. �e doFinal operation in the next line performs the encryption
of the plainText.
�ere are three issues in this code example. First, the generated key and the encryption cipher
do not match (AES vs. Blow�sh). Second, and related, the key length of 448 is not suitable for an
AES algorithm that expects a size of 128. �ird, depending on the crypto provider, AES is used
with electronic codebook mode (ECB) which results in low entropy of the bytes of the encoded
ciphertext encText. While the �rst two misuses throw runtime exceptions, and thus are likely
detected during development, the la�er silently leads to insecure code.
Using IDEal
, we implemented a static-analysis tool that detects those and similar issues for the
Java cryptographic API. On a high level, the analysis is composed of the typestate analysis TSal
,
additional queries to an extended version of Boomerang which extract String and int parameters
on-the-�y, an IDEal
-based taint analysis (i.e., edge functions are identity) and a constraint-solving
analysis. In the example code, the crypto analysis starts at the two calls to getInstance and ensures
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:23
102
103
104
105
106
Constraint
Boomerang
Taint
Typestate
Call Graph
Analysis Time (ms)
Speci�cation # Misuses
MessageDigest 139
Cipher 8
PBEKeySpec 3
PBEParameterSpec 1
Fig. 12. The distribution of the analysis times separated into the di�erent subanalyses of our crypto analysisand the classes ranked by the number of misuses.
with TSal
that the init and generateKey, respective init and doFinal methods are invoked in that
order on the seed objects referenced by the variables keygen and cipher. �e integer value 448 for
this.keyLength is extracted through a query to Boomerang.
�e call sequence on the KeyGenerator object keygen satis�es the required typestate automaton.
�e analysis further infers that an abstract object, a SecretKey for the algorithm Blow�sh with a key
length of 448 has been generated at line 72. �e object �ows (taint analysis) to the �eld this.key
and eventually into the init method of the Cipher. �e Cipher object also satis�es the typestate,
but combining the algorithm parameter of the Cipher object, namely AES and the corresponding
parameter of the SecretKey, Blow�sh, leads to a contradiction when the crypto analysis tries to
solve the encoded constraints.
In addition to the aforementioned classes, we speci�ed rules for a total of 17 classes of the JCA that
drive the crypto analysis. �ose rules include classes like KeyStore, PBEKeySpec for password-based
encryption and SecretKey, their interactions and suggested choice of parameters.
We run the analysis on a set of 200 Android applications that we downloaded from Andro-
Zoo (Allix et al. 2016). All applications within the set were distributed via the Google PlayStore
and received an update in 2016 or 2017. Out of the 200 applications, a total of 144 use classes of
the JCA. We report our results within this subset. Figure 12 summarizes the analysis results. We
divide the analysis time into the di�erent subanalyses. �e box plot shows, in logarithmic scale:
the median, the lower and upper quartile (the box) that limit 50% of the data around the median,
and the lower and upper whiskers (i.e., the minimal and maximal analysis times). �e analysis
times vary vastly across the di�erent applications because they vary greatly in size.
Using call graphs constructed by FlowDroid (Arzt et al. 2014), the number of methods that are
call-graph reachable range from 524 to 16,282, with an average of 3,715. Across the 144 applications,
our crypto analysis requires an average (arithmetic mean) of 123.8 seconds to terminate. �e box
plot shows that most of the analysis time is spent in call graph construction (on average 63.1%
of the total analysis time), followed by the typestate-analysis time. �e number of seeds for the
typestate analysis varies from 2 to 46 across all 144 applications.
Despite the fact that the chosen 144 applications have been recently updated, our analysis �nds
only 5 applications that correctly use the JCA API. �e table in Figure 12 shows that most of the
misuses are related to MessageDigest. �e class is misused in 139 applications, mainly due to using
either MD5 or SHA-1, both of which are no longer recommended for the purpose of cryptography.
We manually veri�ed the correctness of some of the �ndings reported by the analysis. In
particular, we discovered security-related typestate errors in three applications. A�er the usage
of a PBEKeySpec, the developers should call clearPassword() to overwrite the internal copy of
the password. Our analysis correctly identi�es the missing calls. �e issue that is listed for
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:24 Johannes Spath, Karim Ali, and Eric Bodden
PBEParameterSpec in the table in Figure 12 concerns the number of times a hashing algorithm is
applied to a password in a password-based encryption. �e chosen iteration count is set to 100
but is supposed to be higher (our speci�cation expects at least 1000). Additionally, we found some
so�ware-quality related typestate errors. For example, in one application, a developer consecutively
calls the method init three times with the same parameters on a retrieved Cipher object. In a
couple of other applications, we discovered that developers (unnecessarily) reset MessageDigest
objects a�er their initialization.
Our �ndings con�rm previous studies that have reported that 88% of Android applications do
not correctly make use of the JCA API (Egele et al. 2013). Our case study shows that IDEal
can be
used to implement practical, feature-rich client analyses e�ciently in a broader context.
6.7 Threats to Validity
�e evaluation we performed is restricted to a typestate analysis as it is the most related work that
we could �nd an implementation for. �ere is a variety of other clients, such as linear constant
propagation or analysis of correlated calls, that IDEal
can be used for but we have not evaluated
them yet. It is interesting to see the performance of IDEal
on other clients, such as a linear constant
propagation, but we consider it as future work. �erefore, the results give only an idea of IDEal
’s
general performance. Finally, TSf
and TSal
are based on two di�erent analysis frameworks, WALA
and Soot, that do not operate on the exact same program representations, and have di�erent
implementations for the same standard points-to and call graph analyses.
7 RELATEDWORK
In this section, we discuss existing data�ow frameworks, most of which expose the problem of
aliasing to the client analysis, as well as solutions to aliasing that client analyses may apply. For a
more extensive discussion of the state-of-the-art alias analyses for object-oriented programs, we
refer the reader to the survey by Sridharan et al. (2013).
7.1 Dataflow Analysis Frameworks
Apart from IFDS (Reps et al. 1995) and IDE (Sagiv et al. 1996), there exists a wide range of data�ow
frameworks that simplify the implementation of interprocedural static data�ow analyses. For
example, TVLA (Sagiv et al. 1999) uses abstract predicates that evaluate to a three-valued logic. In
addition to 0 (false) and 1 (true), three-valued logic maintains a third value (1/2) that represents
unknown or maybe evaluations. Using predicates enables TVLA to infer aliasing automatically.
TVLA has been extended later to support interprocedural analysis (Gotsman et al. 2006; Jeannet
et al. 2004). While TVLA propagates sets of aliasing pointers, IDEal
propagates aliasing pointers
independently, which drastically reduces the size of the analysis domain.
Separation logic de�nes another technique for data�ow analysis (Reynolds 2002). In separation
logic the heap is modeled explicitly. Each instruction directly operates on the model of the heap,
analog to an actual execution. Separation logic extends standard Hoare logic by adding a separating
conjunction. �e conjunction splits the heap into disjunct regions. At a callsite, for instance, the
heap can be divided into the region accessed within the callee and the region’s complement. With
an extension called bi-abduction, Calcagno et al. (2009) managed to design a scalable shape analysis
based on separation logic. IDEal
propagates abstract pointers (to heap cells) instead of a model
of the cells on the heap, and it is not required to split the heap. Calcagno et al. (2009) achieve
scalability by performing a compositional analysis. Similar bene�ts are expected for IDEal
by
pre-computing summaries, as described by Arzt and Bodden (2016).
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:25
Blackshear et al. (2015) propose a goal-directed approach, called Hopper, to analyze programs that
are based on event-driven systems, such as Android. �e novelty lies in jumping along control-�ow
feasible paths, once a data�ow �ows into system code, e.g. Android. Instead of analyzing the
system’s code, the �ow directly jumps to the respective point in the non-system part of the program.
Hopper relies on separation logic, i.e. explicitly models the heap. �e authors report a signi�cant
performance boost through jumping. In future work, we want to investigate how IDEal
can make
use of a similar functionality.
Ferrara (2014) proposes an abstract-interpretation-based generic framework to value-�ow anal-
ysis that includes heap-reasoning. �e authors formally prove that heap and value analyses can
be combined into one analysis, similar to IDEal
. Two types of analyses that can be instantiated
within their framework are numerical and shape analysis. Opposed to their work, IDEal
computes
context-sensitive results. �is, however, comes by the cost of restricting the value domain from a
in�nite to a �nite height la�ice, as IDEal
does not support a widening operator.
Madsen and Møller (2014) describe a sparse data�ow analysis framework for JavaScript code. A
sparse data�ow analysis operates on def-use chains that it constructs from the control-�ow graph
of a given program. �is approach is di�erent from a traditional data�ow analysis that processes
every statement in the program. Sparse data�ow analyses leverage the fact that def-use chains are
typically more sparse than the control-�ow graph, thus the analysis requires fewer propagations of
data�ow facts. IDEal
follows a similar approach by operating on the object-�ow graph that is a
sparse abstraction of the control-�ow graph.
7.2 Solutions to Aliasing
We have already compared to the typestate analysis by Fink et al. (2008) in detail and skip its
discussion here.
Yahav and Ramalingam (2004) propose a typestate analysis on top of TVLA, but the authors
report later that TVLA does not scale well to large programs (Fink et al. 2006). An interesting
contribution of their work, however, is separation, as they report a huge bene�t in separating the
typestate analysis into subproblems. We use a simple version of separation in TSal
by invoking
IDEal
per tracked object.
Naeem and Lhotak (2008) show how to perform a typestate analysis (TSn), using property
speci�cations called tracematches (Bodden et al. 2008). Tracematches are a language extension to
AspectJ8, and allow the analysis to select program points using declarative pa�erns called pointcuts.
TSn
uses a coarse-grained �eld-insensitive abstraction for the objects that are allocated on the heap.
In contrast to TSf
and TSal
, TSn
can check for pa�erns that detect buggy interactions of multiple
objects (e.g., updating a list while an iterator iterates over it). TSn
implements this by tracking
all objects that are allocated in the input program, which is a signi�cant limitation to e�ciency.
Naeem and Lhotak (2011) later overcome this limitation by generating �ow-insensitive callee and
caller summaries. �e summaries are constructed by pre-analyzing the methods where pointcuts
have no matches. TSn
then plugs in those summaries at the appropriate call sites. We plan to extend
IDEal
to use a similar approach to synchronize information about multiple interacting objects.
Tripp et al. (2013) propose Andromeda, an IFDS-based taint analysis that handles aliasing by
propagating access paths individually. Similar to IDEal
, Andromeda resolves aliases in a context-
sensitive and �ow-sensitive fashion through an on-demand backward analysis. Unlike IDEal
, due
to propagating aliases individually, Andromeda does not support strong updates. Once tainted,
Andromeda does not untaint an object if it is sanitized through an alias. FlowDroid (Arzt et al.
2014) takes a similar approach to alias resolution.
8h�ps://eclipse.org/aspectj/
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:26 Johannes Spath, Karim Ali, and Eric Bodden
8 CONCLUSION
We have presented IDEal
, an extension to the IDE framework that automatically handles aliases in
a precise and e�cient manner. One of the key features in IDEal
is performing sound strong updates,
which require must-alias information, while propagating aliases individually. �is individual
propagation of pointers enables IDEal
to reuse �ne-grained procedure summaries, improving its
e�ciency. IDEal
relieves static-analysis authors of the burden of encoding alias information in the
data�ow domain. Using IDEal
enables static-analysis authors to easily implement a wide range
of data�ow analyses that track object �ows, such as typestate analysis and property inference by
simply de�ning the IDE edge functions, while aliases are handled internally by IDEal
.
Our empirical evaluation shows that an IDEal
-based typestate analysis is as e�cient and precise
as a state-of-the-art typestate analysis, despite the overhead time of the underlying on-demand alias
analysis. IDEal
achieves that by the use of a more precise abstraction of data�ow facts that avoids
unnecessary over-approximations. �is results in analyzing only relevant parts of the program. In
our experiments, an IDEal
-based typestate analysis requires 10.4× fewer data�ow propagations
than a state-of-the-art typestate analysis. On larger programs, the fewer propagations lead to a
1.3× faster typestate analysis.
ACKNOWLEDGMENTS
�is research was supported by a Fraunhofer A�ract grant as well as the Heinz Nixdorf Foundation.
�is material is also based upon work supported by the National Sciences and Engineering Research
Council of Canada.
REFERENCES
Kevin Allix, Tegawende F. Bissyande, Jacques Klein, and Yves Le Traon. 2016. AndroZoo: collecting millions of Android
apps for the research community. In Proceedings of the 13th International Conference on Mining So�ware Repositories, MSR
2016, Austin, TX, USA, May 14-22, 2016. 468–471. DOI:h�p://dx.doi.org/10.1145/2901739.2903508
Rajeev Alur, Pavol Cerny, P. Madhusudan, and Wonhong Nam. 2005. Synthesis of interface speci�cations for Java classes.
In Symposium on Principles of Programming Languages (POPL). 98–109.
Steven Arzt and Eric Bodden. 2016. StubDroid: automatic inference of precise data-�ow summaries for the android
framework. In International Conference on So�ware Engineering (ICSE). 725–735.
Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien
Octeau, and Patrick McDaniel. 2014. FlowDroid: precise context, �ow, �eld, object-sensitive and lifecycle-aware taint
analysis for Android apps. In Programming Language Design and Implementation (PLDI). 259–269.
Stephen M. Blackburn, Robin Garner, Chris Ho�mann, Asjad M. Khan, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan,
Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony L. Hosking, Maria Jump, Han Bok Lee,
J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovic, �omas VanDrunen, Daniel von Dincklage, and Ben Wiedermann.
2006. �e DaCapo benchmarks: Java benchmarking development and analysis. In Object-Oriented Programming Systems,
Languages and Applications (OOPSLA). 169–190.
Sam Blackshear, Bor-Yuh Evan Chang, and Manu Sridharan. 2015. Selective control-�ow abstraction via jumping. In
Object-Oriented Programming Systems, Languages and Applications (OOPSLA). 163–182.
Eric Bodden, Reehan Shaikh, and Laurie J. Hendren. 2008. Relational aspects as tracematches. In International Conference on
Aspect-Oriented So�ware Development (AOSD). 84–95.
Eric Bodden, Tarsis Toledo, Marcio Ribeiro, Claus Brabrand, Paulo Borba, and Mira Mezini. 2013. SPLLIFT
: statically
analyzing so�ware product lines in minutes instead of years. In ACM SIGPLAN Conference on Programming Language
Design and Implementation, PLDI ’13, Sea�le, WA, USA, June 16-19, 2013. 355–364. DOI:h�p://dx.doi.org/10.1145/2491956.
2491976
Cristiano Calcagno, Dino Distefano, Peter W. O’Hearn, and Hongseok Yang. 2009. Compositional shape analysis by means
of bi-abduction. In Symposium on Principles of Programming Languages (POPL). 289–300.
Nurit Dor, Michael Rodeh, and Shmuel Sagiv. 2000. Checking Cleanness in Linked Lists. In International Symposium on
Static Analysis (SAS). 115–134.
Manuel Egele, David Brumley, Yanick Fratantonio, and Christopher Kruegel. 2013. An empirical study of cryptographic
misuse in android applications. In 2013 ACM SIGSAC Conference on Computer and Communications Security, CCS’13,
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
IDEal : E�icient and Precise Alias-aware Dataflow Analysis 99:27
Berlin, Germany, November 4-8, 2013. 73–84.
Pietro Ferrara. 2014. Generic Combination of Heap and Value Analyses in Abstract Interpretation. In Veri�cation, Model
Checking, and Abstract Interpretation (VMCAI). 302–321.
Stephen J. Fink, Eran Yahav, Nurit Dor, G. Ramalingam, and Emmanuel Geay. 2006. E�ective typestate veri�cation in the
presence of aliasing. In International Symposium on So�ware Testing and Analysis (ISSTA). 133–144.
Stephen J. Fink, Eran Yahav, Nurit Dor, G. Ramalingam, and Emmanuel Geay. 2008. E�ective typestate veri�cation in the
presence of aliasing. ACM Transactions on So�ware Engineering and Methodology (TOSEM) 17, 2 (2008).
Manuel Ge�en, Hannes Sa�rich, and Peter �iemann. 2014. Precise Interprocedural Side-E�ect Analysis. In International
Colloquium on �eoretical Aspects of Computing (ICTAC). 188–205.
Rakesh Ghiya and Laurie J. Hendren. 1996. Is it a Tree, a DAG, or a Cyclic Graph? A Shape Analysis for Heap-Directed
Pointers in C. In Symposium on Principles of Programming Languages (POPL). 1–15.
Alexey Gotsman, Josh Berdine, and Byron Cook. 2006. Interprocedural Shape Analysis with Separated Heap Abstractions.
In International Symposium on Static Analysis (SAS). 240–260.
Bertrand Jeannet, Alexey Loginov, �omas W. Reps, and Shmuel Sagiv. 2004. A Relational Approach to Interprocedural
Shape Analysis. In International Symposium on Static Analysis (SAS). 246–264.
Vini Kanvar and Uday P. Khedker. 2016. Heap Abstractions for Static Analysis. ACM Computing Surveys (CSUR) 49, 2 (2016),
29:1–29:47.
Uday P. Khedker, Amitabha Sanyal, and Amey Karkare. 2007. Heap reference analysis using access graphs. ACM Transactions
on Programming Languages and Systems (TOPLAS) 30, 1 (2007).
Ondrej Lhotak and Kwok-Chiang Andrew Chung. 2011. Points-to analysis with e�cient strong updates. In Symposium on
Principles of Programming Languages (POPL). 3–16.
Magnus Madsen and Anders Møller. 2014. Sparse Data�ow Analysis with Pointers and Reachability. In International
Symposium on Static Analysis (SAS). 201–218.
Sarah Nadi, Stefan Kruger, Mira Mezini, and Eric Bodden. 2016. Jumping through hoops: why do Java developers struggle
with cryptography APIs?. In Proceedings of the 38th International Conference on So�ware Engineering, ICSE 2016, Austin,
TX, USA, May 14-22, 2016. 935–946.
Nomair A. Naeem and Ondrej Lhotak. 2008. Typestate-like analysis of multiple interacting objects. In Object-Oriented
Programming Systems, Languages and Applications (OOPSLA). 347–366.
Nomair A. Naeem and Ondrej Lhotak. 2011. Faster Alias Set Analysis Using Summaries. In Compiler Construction (CC).
82–103.
Nomair A. Naeem, Ondrej Lhotak, and Jonathan Rodriguez. 2010. Practical Extensions to the IFDS Algorithm. In Compiler
Construction (CC). 124–144.
Rohan Padhye and Uday P. Khedker. 2013. Interprocedural data �ow analysis in Soot using value contexts. In International
Workshop on State Of the Art in Java Program analysis, (SOAP). 31–36.
Marianna Rapoport, Ondrej Lhotak, and Frank Tip. 2015. Precise Data Flow Analysis in the Presence of Correlated Method
Calls. In International Symposium on Static Analysis (SAS). 54–71.
�omas W. Reps, Susan Horwitz, and Shmuel Sagiv. 1995. Precise Interprocedural Data�ow Analysis via Graph Reachability.
In Symposium on Principles of Programming Languages (POPL). 49–61.
John C. Reynolds. 2002. Separation Logic: A Logic for Shared Mutable Data Structures. In Symposium on Logic in Computer
Science (LICS). 55–74.
Shmuel Sagiv, �omas W. Reps, and Susan Horwitz. 1996. Precise Interprocedural Data�ow Analysis with Applications to
Constant Propagation. �eoretical Computer Science 167, 1&2 (1996), 131–170.
Shmuel Sagiv, �omas W. Reps, and Reinhard Wilhelm. 1999. Parametric Shape Analysis via 3-Valued Logic. In Symposium
on Principles of Programming Languages (POPL). 105–118.
Johannes Spath, Lisa Nguyen �ang Do, Karim Ali, and Eric Bodden. 2016. Boomerang: Demand-Driven Flow- and Context-
Sensitive Pointer Analysis for Java. In European Conference on Object-Oriented Programming (ECOOP). 22:1–22:26.
Manu Sridharan, Satish Chandra, Julian Dolby, Stephen J. Fink, and Eran Yahav. 2013. Alias Analysis for Object-Oriented
Programs. In Aliasing in Object-Oriented Programming. Types, Analysis and Veri�cation. 196–232.
Manu Sridharan, Denis Gopan, Lexin Shan, and Rastislav Bodık. 2005. Demand-driven points-to analysis for Java. In
Object-Oriented Programming Systems, Languages and Applications (OOPSLA). 59–76.
Omer Tripp, Marco Pistoia, Patrick Cousot, Radhia Cousot, and Salvatore Guarnieri. 2013. Andromeda: Accurate and
Scalable Security Analysis of Web Applications. In International Conference on Fundamental Approaches to So�ware
Engineering (FASE). 210–225.
Octavian Udrea and Cristian Lumezanu. 2006. Rule-Based Static Analysis of Network Protocol Implementations. In USENIX
Security Symposium. 193–208.
John Whaley, Michael C. Martin, and Monica S. Lam. 2002. Automatic extraction of object-oriented component interfaces.
In International Symposium on So�ware Testing and Analysis (ISSTA). 218–228.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.
99:28 Johannes Spath, Karim Ali, and Eric Bodden
Eran Yahav and G. Ramalingam. 2004. Verifying safety properties using separation and heterogeneous abstractions. In
Programming Language Design and Implementation (PLDI). 25–34.
Dacong Yan, Guoqing (Harry) Xu, and Atanas Rountev. 2011. Demand-driven context-sensitive alias analysis for Java. In
International Symposium on So�ware Testing and Analysis (ISSTA). 155–165.
PACM Progr. Lang., Vol. 1, No. OOPSLA, Article 99. Publication date: October 2017.