# Precise Symbolic State Merging

Why we need nondeterministic value summaries for state merging, and a time dimension to reconstruct execution traces

## Summary

State merging is a standard technique for mitigating path explosion in symbolic execution. Nondeterministic value summaries are guarded sets of values with overlapping guards. Without those, we cannot perform fully precise state merging in all situations (do-while loops, unstructured code, nondeterministic programs). Reconstructing symbolic execution traces from states merged with nondeterministic value summaries requires some notion of time or origin.

## Table of Contents

## Symbolic Execution and Path Explosion

Testing and debugging are techniques appreciated by any reasonable programmer. They help to rule out major problems in an implementation and require little specialized knowledge. After the project runs for most inputs and passes the test suite, those techniques, however, reach their limits, as they can only ever show the presence, but not the absence of bugs; in other words: It’s likely you miss some border case that can make your program crash or more generally fail.

Automatic test case generation and, quite prominently, fuzzing techniques help to cover many missed inputs without user assistance. This is still testing, though. In general, it’s still possible (and for complex programs quite likely) that some equivalence class of inputs remains untested.

This is where static analysis enters the stage. If high confidence levels into
the program under test need to be established, the program has to be analyzed
independently of concrete inputs. One such analysis technique is *Symbolic
Execution*. The principles are simple: Instead of calling a method with
concrete inputs like 17 and 42, call it with symbolic inputs like $a_0$ and
$b_0$. The consequences are more difficult: Most prominently, concrete upper
bounds for loops and recursive calls are usually missing, so that mitigation
strategies such as loop invariants and method contracts have to be used to
ensure termination of the analysis. Automatic discovery of such abstractions
is a hot research topic, but not the topic of this article.

Symbolic execution maintains Symbolic (Execution) States consisting of a *path
condition* updated by case distinctions and a *symbolic store* updated by
assignments. Starting from an empty state $(\emptyset, –)$, an assignment
statement `i = 17;`

updates this state to $(\emptyset, i\mapsto17)$. A
subsequent if statement

```
if (k>0) {
i=2*k;
} else {
i=42;
}
```

induces a case distinction (since the truth value of the symbolic guard is unknown at this point) and produces two output states: $(\{k>0\}, i\mapsto17)$ and $(\{k\leq0\}, i\mapsto17)$. Executing the respective branches updates those states to $(\{k>0\}, i\mapsto2\cdot{}k)$ and $(\{k\leq0\}, i\mapsto42)$.

## Stopping Explosion: Symbolic State Merging

From there on, these branches are traditionally followed independently,
forming a tree structure. Any code after the if statement is thus executed
twice, which leads to an exponential blow-up^{1}. A well-known approach to
avoid this explosion is *Symbolic State Merging*^{2} ^{3}, which
brings back together branches after a split. The resulting data structure
resembles the CFG of the executed code although usually, cycles are broken up
by introducing abstraction (we get a directed acyclic graph).

For example, after the if statement above, we know that i attains the value
$2\cdot{}k$ if k is strictly positive and the value 42 otherwise. Using a
conditional expression^{3} ^{4} we can directly express this as
$(\emptyset,
i\mapsto\mathit{if}\,(k>0)\,\mathit{then}\,(2\cdot{}k)\,\mathit{else}\,(42))$.
From this state on, we can continue executing any statements that come after
the conditional, avoiding executing them twice.

Using conditional expressions for state merging is an example of *fully
precise* state merging. Alternatively, one could use data abstractions as in
Abstract Interpretation^{5}. For example, since k is positive in the *then*
branch, we know that the final value of i after the if statement is also
positive. A possible, abstracted symbolic state would thus be
$(\{c>0\},i\mapsto{}c)$ for some fresh constant $c$.

## The Limits of Conditionals

The strength of symbolic execution is its high precision. Ideally, one would only introduce abstractions if really necessary (as for loop invariants). Furthermore, merging with conditional expressions works fully automatically, while abstract domains for data abstractions have to be chosen upfront.

There is however one limitation to using conditional expressions. Consider the following code with a loop:

```
do {
i--;
} while (i > 0)
```

Assume that we want to repeatedly execute the loop symbolically, maybe with the aim to discover an invariant. When first entering the do loop, we’re in the symbolic state $(\emptyset, i\mapsto{}i_0)$ (where $i_0$ is the initial value of i). After one iteration, the resulting state is $(\{i_0>0\},i\mapsto{}i_0-1)$.

Can we merge those states using conditionals?

An empty path condition is equivalent to truth (the state is reached by all paths), so the result when merging the first and the second states is

$$(\{\mathit{true}\vee{}i_0>0\},i\mapsto\mathit{if}\,(true)\,\mathit{then}\,(i_0)\,\mathit{else}\,(i_0-1)).$$

This is, though, equivalent to $(\emptyset,i\mapsto{}i_0)$! If we merge the states the “other way round”, we obtain a different result:

$$(\{\mathit{true}\vee{}i_0>0\},i\mapsto\mathit{if}\,(i_0>0)\,\mathit{then}\,(i_0-1)\,\mathit{else}\,(i_0)).$$

This cannot be what we want, and does of course not meet soundness criteria of
symbolic execution (“exhaustiveness” and “precision”^{4}). Sound state
merging is always symmetric (and associative). Indeed, the merge method using
conditional expressions may only be applied if the input states are
“distinguishable”^{4}, i.e., if only ever *one* path condition can hold
true at the same time.

## Design Options for Precise Merging

Of course, we could resort to abstraction—or use a different kind of
summation operator. *Value Summaries* have been proposed by Sen et
al.^{2} for precise symbolic state merging. A value summary is a guarded
collection of values: $i\mapsto\{(k>0\vert2\cdot{}k),(k\leq0\vert{}42)\}$.
The way Sen et al. use them suggests that they are basically thinking of
conditionals, since path conditions are distinguishable as in the example.

This restriction, however, is not necessary. The full strength of summaries
becomes apparent when relaxing them to *Nondeterministic Value Summaries*: If
more than one guard is satisfiable by an initial assignment, the summary can
attain either of the guarded values. In the extreme case, nondeterministic
value summaries are sets: $\{(true\vert17),(true\vert42)\}$ is equivalent to
$\{17,42\}$.

Applied to our loop example from above, we can summarize the initial state and the state after one iteration to

$$(\{\mathit{true}\vee{}i_0>0\},i\mapsto\{(i_0>0\vert{}i_0-1),(\mathit{true}\vert{}i_0)\})$$

or equivalently

$$(\emptyset,i\mapsto\{(i_0>0\vert{}i_0-1),i_0\})$$

using a nondeterministic value summary.

This strategy is symmetric and sound. It can be applied without restrictions, and is yet equivalent to merging with conditionals when path conditions are distinguishable. Furthermore, repeated merging yields “flatter” terms which are nicer to look at.

It is not strictly required to add value summaries as a new data type for
right-hand sides of symbolic stores; alternatively, they could be encoded
using fresh constants constrained by path conditions^{4}, similarly to
how we proceeded for data abstraction above. The downsides of this approach
are, among others, the introduction of the constants, which can be troublesome
for some constraint solvers, and the fact that path conditions no longer only
consist of decisions taken during execution, but also of “artificial”
constraints on the store.

## The Problem with the Traces and the Time Dimension

Yet, how can I call this nondeterministic summary “fully precise”? Indeed,
when regarding the resulting state in isolation, we do not know if i attains
its original value or that value decreased by one, even if we know that the
initial value is strictly positive. The notion “precise” stems from the fact
that exactly all concrete states represented by the merged inputs are still
represented by the output with the summary. So, the “problem” is a different
one: If we’re in the (symbolic) initial state where $i_0$ could be either
strictly positive or smaller or equal than 0, we cannot retrieve the exact
value of i from the summary. In other words, we cannot reconstruct the
(symbolic) execution *trace* from the summarized state.

By adding a “time dimension”, we can rescue this capability. Assume that a variable t tracks the number of times a statement has been the execution target. Adding this to the guards of the value summary yields

$$(\{t\doteq0\vee{}(t\doteq1\wedge{}i_0>0)\},i\mapsto\{(t\doteq1\vert{}i_0-1),(t\doteq0\vert{}i_0)\})$$

The resulting states are distinguishable again, such that we even could use a
conditional for merging. To reconstruct symbolic traces when using state
merging with nondeterministic value summaries, we—*generally*—need some
(abstract) notion of time. Usually, a properly used loop counter suffices.

## Implementations of Value Summaries

Value summaries (used deterministically) have been proposed^{2} and
implemented^{6} by Sen et al. The implementation does not seem to be
maintained any more. More recently, value summaries are used for fully
precise, automatic symbolic state merging by the *pluggabl* tool^{7}.
pluggabl is a symbolic execution engine for JVM bytecode which automatically
performs state merging at all merge points, aligning the analysis result to
the CFG structure.

Precise merging with conditional expressions has been implemented in the KeY
tool^{8}. State merging in KeY is not done automatically, but requires
either point-and-click interaction with the symbolic execution tree or merge
annotations in source code.

## Conclusion

Value summaries are guarded sets of values. In deterministic value summaries,
all constraints are mutually exclusive (“distinguishable”); *non*deterministic
summaries allow for overlapping guards. They are needed in certain situations,
like do-while loops, where path conditions of input states are not necessarily
distinguishable. This is also useful for unstructured programming languages or
nondeterministic code. To be able to perform *fully precise* symbolic state
merging in all situations, nondeterministic value summaries are
indispensable, be it as special data types or encoded into path conditions.
Their nondeterministic nature only becomes a problem if we want to be able to
reconstruct execution traces from merged states. Then, we need a notion of
*time* (for deterministic programs) or some general notion of *origin* (for
nondeterministic programs with several entry points).

Cristian Cadar, Koushik Sen (2013): Symbolic Execution for Software Testing: Three Decades Later. Commun. ACM

**56**(2), 82–90 ↩︎Koushik Sen, George C. Necula, Liang Gong, Wontae Choi (2015): MultiSE: Multi-Path Symbolic Execution Using Value Summaries.

*ESEC/SIGSOFT FSE*: 842-853 ↩︎- A General Lattice Model for Merging Symbolic Execution Branches.(2016). ↩︎
*Proc. 18th Intern. Conf. on Formal Engineering Methods (ICFEM)*. - Abstract Execution: Automatically Proving Infinitely Many Programs. TUPrints, Technical University of Darmstadt, Department of Computer Science.(2020). ↩︎
Patrick Cousot, Radhia Cousot (1977): Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints.

*POPL 1977*: 238-252 ↩︎https://www.key-project.org/2017/05/03/new-feature-state-merging-in-key/ ↩︎