Synsh / Blog RSS

Synsh Internals: Shell Pipeline Synthesis with Beam Search

Synsh is a program synthesis tool for shell pipelines: given an example of an input text and the desired output, Synsh attempts to generate a shell pipeline that transforms the example input into the example output. For example, suppose you give Synsh the following example input describing the members of Talking Heads:

Vocals David
Guitar David
Guitar Jerry
Bass Tina
Drums Chris

as well as the following example output, which moves the names to the first field, sorts the band members alphabetically by name, and removes duplicates:

Chris
David
Jerry
Tina

Then Synsh will generate the following shell pipeline, preceded by comments describing what each command in the pipeline does:

# 1. Select field 2, separated by " ".
# 2. Sort lines by alphanumeric value.
# 3. Output unique lines.
cut -d ' ' -f 2 | sort | uniq

The process Synsh uses to find a pipeline that transforms the example input into the example output, at a high level, is roughly this:

The remainder of this post makes this description precise, shows examples, and discusses a few optimizations on top of the basic algorithm.

Finding Pipelines with Tree Search

A simple way to search for pipelines is to begin with the example input and apply all the commands at your disposal to the input. If the result of applying any of those commands is the example output, you’ve found a one-command pipeline that maps the example input to the example output. Otherwise, take all the outputs from the first step and apply every available command to those, and so on, until you get the desired output or give up.

This process generates a tree:

Thus, we can view pipeline synthesis as a tree search problem. To find a pipeline that maps an example input to an example output, we start from the root, which is labeled with the example input, and search for a path to a node that’s labeled with the example output. If we find such a path, the edges along the path constitute a pipeline that can be used to transform the example input into the example output. (Of course, we may not find a pipeline, for a variety of reasons. For example, it may not be possible to produce the example output from the example input using the commands we have available.)

As an illustration, consider that we have the following two commands available:

Suppose that we give the following example input text:

c
a
b

and want to produce the following example output text:

abc

That is, we want to sort the letters and collapse the text onto a single line. Here’s a possible search tree for this setup:

Full search tree for input c\na\nb.
            From the root, we can apply 'sort', to obtain a\nb\nc, or 'tr -d \n', to obtain cab.
            Applying 'sort' to a\n\b\nc does not modify it; applying 'tr -d \n' to it yields abc.
            Applying 'sort' to cab does not modify it, nor does applying 'tr -d \n'.
Search tree for example input c\na\nb and example output abc, using the commands sort and tr -d '\n'. The goal node containing the example output is shown with a thick border. Commands which have no effect on their inputs are shown with a dashed line; we do not need to explore these edges during search.

Given this search tree, we’re able to find the desired pipeline, sort | tr -d '\n', by following the path from the node labeled with the example input to the node labeled with the example output.

Note that we can prune the tree, and thus optimize the search, by removing edges corresponding to commands that don’t have any effect on their input text. For example, the tr -d '\n' command removes newlines from its input. There are no newlines in the text cab, so the tr -d '\n' command labeling the dashed edge leading from the node labeled cab can be eliminated from the search tree, and we do not need to consider pipelines generated by that subtree.

In Synsh, this pruning is implemented by generating a list of relevant commands for each input and only applying those commands, instead of all available commands, at each step of the search. For example, for a text containing both newlines and commas but not spaces or colons, tr commands deleting newlines and commas are relevant; tr commands deleting spaces and colons are not relevant, and will not be applied during search. To further control the search space, some commands, like tr, are only applied with parameters from a fixed set — for example, only common punctuation and whitespace characters for tr, and only common delimiter characters for cut.

Narrowing the Search Space: Beam Search with Edit Distance

Having defined pipeline synthesis as a tree search problem, we can apply standard tree search algorithms to find pipelines. However, we have to take care with our choice of search algorithm: the search tree is infinite and, even if we truncate to a finite depth, it still grows exponentially at each level. For example, if there are eight commands that can be run at each node, truncating the tree to ten levels leads to 810 = (23)10 = 230, or approximately one billion, nodes to explore. Thus, we rule out standard depth-first or breadth-first search.

Instead, Synsh uses beam search. Beam search is similar to breadth-first search but, instead of visiting the children of all nodes at each level, beam search visits the children of only a limited number of nodes, called the beam width, b, with the lowest “distance” to the goal, where the distance is defined in a problem-specific way. Thus, for a given beam width b and search depth d, beam search is guaranteed to only ever visit O(bd)-many nodes; contrast this with breadth-first search, which, for a tree where nodes have w children, where w is typically much larger than b, must visit O(wd)-many nodes.

Note that beam search requires that we define an order among nodes to determine which nodes to visit at each step. In Synsh, our goal is to transform an example input text into an example output text; thus, we order nodes by the Levenshtein distance, also commonly known as “edit distance,” between the node’s text and the example output text. That is, at each search step, we choose to expand the nodes whose texts are “closest” to the example output in that their corresponding texts have the lowest numbers of insertions, deletions, and substitutions compared to the example output.

Finally, since there may be no pipeline that maps the example input to the example output or a pipeline exists but cannot be found in a reasonable amount of time, we terminate the search after reaching a pre-specified maximum depth.

Example: Field Projection

Suppose that we want to transform this comma-separated example input:

Arthur,Margaret,Octavia,Ted

into this comma-separated example output:

Margaret,Ted

We’ll assume that we have only one form of command available, comma-separated field projection with cut:

 cut -d , -f [comma-separated list of fields to extract]

To illustrate this command, on our example input, cut -d , 1,3 produces the following:

$ echo "Arthur,Margaret,Octavia,Ted" |
    cut -d , -f 1,3
Arthur,Octavia

Before we can use this command in our pipeline search, we’ll make an important performance optimization. Note that, if there are n fields in the input, the number of different subsets of fields we can project from the input is 2n - 1. (The minus-one is there because it doesn’t make sense to project zero fields out of the input.) Thus, we cannot apply all possible projections at every step of the search — the search space would be too large. To keep the search space manageable while still allowing Synsh to find all possible field projections, we only allow cut commands to remove a single field at a time; projections that remove more than one field are achieved by chaining together multiple projections that each remove a single field.

Given the restriction to only remove a single field at a time, one possible way of rewriting our previous example is as follows, where we first remove the fourth field, then remove the second:

$ echo "Arthur,Margaret,Octavia,Ted" |
    cut -d , -f 1,2,3 |
    cut -d , -f 1,3
Arthur,Octavia

(Note that Synsh actually produces a single-command pipeline for this example. Internally, Synsh uses the leave-one-out strategy during search, but has logic to collapse chains of leave-one-out commands into a single command for output. This will be covered in a future post.)

Continuing with our example, we show the full search tree for a pipeline that maps example input Arthur,Margaret,Octavia,Ted to Margaret,Ted using leave-one-out projection commands and a beam width of two:

Full search tree: from the example input 'Arthur,Margaret,Octavia,Ted', we produce four leave-one-out projections; we then select the two with the lowest edit distances from the example output to expand further: 'Margaret,Octavia,Ted', with edit distance 8, and 'Arthur,Margaret,Ted', with edit distance 7. In either case, one more leave-one-out projection reaches the goal, so we find a pipeline in two search steps.
Search tree for example input Arthur,Margaret,Octavia,Ted and example output Margaret,Ted, with a beam width of two, using only commands of the form cut -d , -f [list of fields to extract]. Each node is labeled by the text produced by the pipeline up to that point, along with the text's Levenshtein distance from the example output. Nodes that are chosen for further expansion, as well as the nodes corresponding to the goal state, are highlighted with thicker outlines. Note that, as our beam width is two, the only children of the root that we choose for further expansion are the nodes with the two lowest edit distances from the goal.

Note that we reach the goal in two steps, and that choosing a beam width of two allowed us to reduce our search space by 50%.

Optimizations

Below, we describe some optimizations to the search algorithm that allow us to narrow the search space or avoid repetitive computations.

Eliminating Redundant Paths

During the search, we may find that the same text is produced along multiple paths. For example, given the text a b c, the following two commands both produce the same output:

$ cut -d ' ' -f 1,3
a c
$ awk '{ print $1, $3 }
a c

Keeping multiple paths that produce the same text both wastes time and, since we only explore a fixed number of paths, risks duplicate paths crowding out novel paths that may lead to the goal. Thus, for each set of paths that produce the same text, we keep only one path (pipeline) that produces that text. To decide which path to keep, we also track the “cost” of each command along every path. This cost is informally defined to capture the intuitive run-time and conceptual “complexity” of a command; for example, a cut command is intuitively simpler than the equivalent awk command, so we assign it a lower cost.

Iterative Widening

Beam search requires a beam width parameter, which introduces a tradeoff: setting the width too small causes search to fail in more cases where a pipeline could have been found with a larger width, while setting the width too large results in unnecessary work being done in cases where a beam smaller width would have sufficed. To address this tradeoff, Synsh uses iterative widening: the beam search is first run with a width of w, then a width of 2w, and so on, up to a pre-specified maximum width of W = w2m. Since ∑i = 0…m w2i = w ∑i = 0…m 2i = w(2m + 1 - 1) < w2m + 1 = 2w2m = 2W ∈ O(W), the total number of nodes expanded by the search is linear in the maximum beam width, W. In particular, we’ll expand at most twice as many nodes as necessary.

Note that naive iterative widening causes some redundant expansions early in the search, some of which are quite easy to reduce. To illustrate, we’ll consider how iterative widening explores the first two levels of a search tree. (Note we do not count the root as a level, so the first level begins with the root’s successors.) First, we show the whole first two levels, without any beam width restriction:

Tree with root node R.  R has successors A, distance 5 from the goal, and B, distance 6 from the goal.  A has successors C, distance 3 from the goal, and D, distance 4 from the goal.  B has successors E, distance 1 from the goal, and F, distance 2 from the goal.
First two post-root levels of a fully-expanded search tree.

Suppose we apply iterative widening with a beam width of one, then a beam width of two. In the first iteration, when the beam width is one, the search proceeds as follows:

The first two levels of the search are illustrated below.

Previous tree, expanded with a beam width of one.  At level one, only A is expanded.  At level two, only C will be expanded.
First two post-root levels of the same tree, expanded with a beam width of one. Nodes chosen for expansion are marked with thick borders; if the search were to continue to the third level, node C would be expanded.

When the beam width is two, the search proceeds as follows:

This is illustrated below.

Previous tree, expanded with a beam width of one.  At level one, only A is expanded.  At level two, only C will be expanded.
First two post-root levels of the same tree, expanded with a beam width of two. Note that, no matter the beam width, we always need to compute the root's successors.

There are two sources of redundancy here. First, and most obviously, the successors of the root are computed at each iteration. At the first level, the only thing that differs between searches of different widths is how many of the root node’s successors with the lowest distances we select for further expansion. Thus, we compute the root node’s successors once and sort them by increasing distance from the example output. At each widening iteration, rather than computing the root’s successors from scratch, we simply take the width-many lowest-distance nodes from our precomputed, sorted list of root node successors.

The second source of redundancy is that, as we include more of the root node’s successors in the widening search, the set of successors of the first-level nodes that we need to consider expands monotonically. In our example, in the first iteration, we only needed to consider the successors of A, which are C and D. In the second iteration, we needed to consider the successors of A, as well as the successors of B, which are E and F. This set will monotonically increase with the search width. However, since the set depends on the nodes that are expanded on the first level, it’s less straightforward to maintain this set across iterations; this optimization is not implemented in Synsh yet and is left for future work.

(Note that there’s no guarantee of redundancy at the third level of the search. In our example, in the first widening iteration, with a beam width of one, the third level of the search would consider the successors of node C. In the second widening iteration, with a beam width of two, the third level of the search would consider the successors of nodes E and F. Thus, in the example, there is no overlap at the third level between iterations with different widths.)

Below, we test iterative widening on Synsh’s benchmark set (compiled to native code), consisting of 31 examples. For runs with iterative widening, we use an initial beam width of 16 and a maximum beam width of 512; runs without iterative widening use a fixed beam size of 512. The maximum width was chosen as the smallest number that succeeds for the most-complex benchmark; the minimum number was chosen to be the smallest value that obtains the same pipelines for all benchmarks as using a fixed beam width of 512. The maximum depth is 6, again chosen as the minimum for which all benchmarks succeed.

Graph showing speedups from iterative widening search.
Total time without iterative search widening (ms) 2,666 ± 11
Total time with iterative search widening (ms) 1,482 ± 7
Mean speedup 4.6
25th-percentile speedup 1.0
50th-percentile speedup 1.0
75th-percentile speedup 1.4
Maximum speedup 45.7
Maximum slowdown 2.2
Speedups from iterative search widening, measured on the median out of 100 runs, on a 2017 iMac (Intel i5-7500 at 3.4GHz). Total times are given alongside +/- median absolute deviation.

In the results above, benchmarks with the annotation “(infeasible)” are examples where the example output cannot be produced from the example input using the available commands. These infeasible examples exercise the search failure case, which is extremely practically important: if the user encounters a search error, they’ll need to refine the example, perhaps several times, and the user’s iteration speed will be limited by the time it takes for search to fail.

From the distribution of speedups, we can see that most benchmarks are unaffected by this optimization, which is also reflected in the lack of change to the total time for running the entire benchmark suite. However, a few examples are sped up significantly. For example, “sort reorder and translate,” a nontrivial pipeline which sorts by a field, reorders the fields, and changes the field separator: this benchmark is sped up 45.7x, from around one second, which is perceptibly slow, to around 20ms, which is fast enough to appear instantaneous. The slowdowns are primarily in the “infeasible” cases, which hover around or over 100ms to begin with; in these cases, we’re making an already-slow case a bit slower, which is regrettable but an acceptable tradeoff for the significant speedups on other cases.

(Disclaimer: these benchmarks are informal and are not, for example, run on a fully-isolated machine nor robust against spurious performance changes caused by factors like environment size or code layout, as described in Producing Wrong Data Without Doing Anything Obviously Wrong! and STABILIZER: Statistically Sound Performance Evaluation.)

Notes on General Approach

Synsh is a bottom-up enumerative program synthesizer: it searches by enumerating valid pipelines, beginning with the atomic unit of a pipeline, a single command, and iteratively extending the current results with new commands. See Armando Solar-Lezama’s program synthesis lecture notes for related background.

As an enumerative synthesizer, Synsh doesn’t use, or even have access to, any knowledge about the semantics of the commands it’s searching over; it’s only able to provide commands with inputs and observe their outputs. This is in contrast to systems like Flash Fill, which use knowledge of the semantics of atomic operations to guide search. Synsh’s black box approach makes it trivial to extend with new commands, since it requires no special support beyond what’s needed for ordinary use by an end user. On the other hand, Synsh’s search could be made more efficient if it had a richer representation of commands that allowed it to prune the search space more intelligently.

For simplicity, Synsh doesn’t use any machine-learned components. At a minimum, it would likely be beneficial to use learned functions to guide Synsh’s search, which currently relies on manual heuristics. At the extreme end, pipeline generation can be viewed as a reinforcement learning or sequence modeling problem, and much of Synsh’s logic could be replaced entirely. However, using learned functions brings complications in the form of generating appropriate training and testing data, which are not required in Synsh’s current incarnation. Further, Synsh pipelines are correct by construction, barring errors in the implementation of the commands themselves, while pipelines generated by a learned system would likely need to be validated by an external process. Given the tradeoffs, we’ve decided to take the purely-search-based, heuristically-guided approach as far as it will go.

Next Post: Generating Concise Pipelines

In this post, we showed how to generate basic pipelines of multiple commands using a simple algorithm based on beam search, using Levenshtein distance to focus the search on candidate pipelines which make the most progress in mapping the example input to the example output. While functional, the pipelines generated by the algorithm shown so far are suboptimal in both run-time performance and readability. For instance, our example field projection pipeline to project two fields out of a four-valued inputs chained together two cut commands, though only one command is necessary. In the next post, we’ll discuss a small modification to the search algorithm that enables it to generate the single-command pipeline.


Thanks to Evan Andrew Rose and Nathan Goulding-Hotta for their helpful feedback on drafts of this post.