This page hosts the accepted artifact for our PLDI 2017 paper on MemSynth:
Artifact
- Paper (PDF, 280 kB)
- VirtualBox image (OVA, 1.2 GB)
Artifact overview
We have provided a VirtualBox image containing the implementation of MemSynth, which is based on Racket and Rosette. The image contains implementations for the paper’s experimental evaluation (Section 6).
After installing VirtualBox and the VirtualBox extension pack, import our VirtualBox image (File > Import Appliance).
When booted, the virtual machine will log in automatically,
open a terminal to the ~/memsynth
directory containing MemSynth,
and open a copy of this instruction page.
Should you need credentials,
the virtual machine user is memsynth
with password memsynth
.
Directory layout
The ~/memsynth
directory contains the source code for MemSynth and the
case studies described below. It is laid out as follows:
case-studies
contains the implementations of the case studies described below.frameworks
contains implementations of two different memory model frameworks (defined in Section 3 of the paper).alglave
implements a framework from the paper Fences in Weak Memory Models, by Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell, published at CAV 2010 (described in Section 3.2 of our paper).madorhaim
implements a framework from the paper Litmus Tests for Comparing Memory Consistency Models: How Long Do They Need to Be?, by Sela Mador-Haim, Rajeev Alur, and Milo M. K. Martin, published at DAC 2011 (described in Section 6.2.1 of our paper).
litmus
contains a small DSL for defining litmus tests.tests
contains several collections of litmus tests, including from Intel’s x86 manual (x86.rkt
) and 768 tests from previous work on PowerPC (ppc-all.rkt
).
memsynth
contains the implementation of the MemSynth engine.ocelot
contains the Ocelot language, our embedding of bounded relational logic in Rosette.
Running via Docker
Some of our experiments (notably uniqueness experiments in the synthesis section) take several hours, and so may not be convenient to run locally in a virtual machine. We provide smaller versions of these experiments where possible. But we also offer a Docker image, identical to the virtual machine, that can be used to run these experiments on bare metal hardware or remotely (e.g., on EC2).
After installing Docker on a system in the usual way, pull our image from the Docker hub:
$ sudo docker pull memsynth/memsynth
and then run it:
$ sudo docker run -ti --name memsynth memsynth/memsynth bash
This command will drop you into a bash
shell in the Docker container.
The container’s file system is identical to the virtual machine’s
(i.e., the code is based in ~/memsynth
),
so the below instructions will work on either platform.
You might be interested in Docker’s attach
support to detach and re-attach to the container without stopping its work.
Case studies
The artifact addresses all three research questions in the paper’s main evaluation (Section 6):
- Can MemSynth scale to real-world memory models such as PowerPC and x86?
(synthesis experiments) - Does MemSynth provide a basis for rapidly building useful automated memory model tools?
(repair experiments) - Does MemSynth outperform existing relational solvers and memory model tools?
(performance experiments)
Model synthesis
Section 6.1 of the paper demonstrates that MemSynth scales to real-world memory models by using it to synthesize specifications for PowerPC and x86. The results also show that MemSynth can identify ambiguities and redundancies in litmus tests and documentation for these architectures.
PowerPC
Section 6.1.1 applies MemSynth to the PowerPC architecture. We show that we can synthesize a model for PowerPC in 16 seconds, and automatically identify several ambiguities in that model.
To begin, enter the directory for this case study:
$ cd ~/memsynth/case-studies/synthesis/ppc
$ ls
compiled ppc0.rkt ppc0-unique.rkt ppc1.rkt sketch.rkt
The sketch.rkt
file contains the sketch we use to synthesize PowerPC models.
Most notably, the sketch includes the following definition for ab
(note that ab
is called fences
in the paper for readability):
(define ab (+ (^ ab-sync) (^ ab-lwsync)))
As the paper notes, this definition captures the insight that
Fence and LWFence (i.e., sync
and lwsync
)
do not interact.
Claim: MemSynth synthesizes a model, which we call PPC0, that agrees with the hand-written model by Alglave et al. on all 768 tests. The synthesis takes 16 seconds, and … uses only 18 of the 768 tests
To reproduce:
$ racket ppc0.rkt
This program will first print some information about the litmus tests and size of the search space, and then print a message of the form
Synthesis complete!
time: 20177 ms
tests used: 18/768
followed by the synthesized model. This result says synthesis completed in 20 seconds, and only 18 of the 768 litmus tests were actually used by the incremental synthesis algorithm.
It also verifies that the model agrees with Alglave et al’s model on all 768 tests:
Verifying solution...
Verified 768 litmus tests
Claim: We apply MemSynth’s uniqueness query to enlarge the set [of 768 tests] until it identifies a single model … MemSynth finds 10 new tests to add to the set
To reproduce (takes ~6 hours; see below for a smaller version):
$ racket ppc0-unique.rkt
This program first synthesizes PPC0 again, then outputs some information about the symbolic litmus test being used for uniqueness. It will then begin identifying new tests, which it outputs in the form:
New distinguishing test [#1]:
Test disambig0
=============================
P0 | P1
-----------------------------
r1 <- [B] | r2 <- [A]
sync | r3 <- r2 ^ r2
[A] <- 1 | [B+r3] <- 1
=============================
r1=1 /\ r2=1
Allowed by oracle? #t
This output says a new test was found with two threads,
and that the test is allowed by the oracle (Alglave et al) model.
The unusual r3 <- r2 ^ r2
instruction (an XOR operation that always evaluates to 0)
introduces a data dependency between the store to B
and the read from A
.
Due to changes since paper submission, and solver non-determinism, this experiment finds 9 tests rather than the 10 stated in the paper.
Note that the program does not produce any output between discovering
distinguishing tests (and there can be over an hour between each such test).
Pass the -v
argument to the program to see more frequent verbose
output, tracking the process of the metasketch solving (as described in the
“Concretization” paragraph of Section 5.3).
Smaller version: run
$ racket ppc0-unique.rkt -s
to execute a version of the experiment with smaller bounds on the symbolic litmus test. This version will complete in 10 minutes, but finds only 2 distinguishing litmus tests rather than the 9 found by the full experiment.
Claim: MemSynth is able to use the results of [hardware] experiments as an oracle, and synthesizes a new model PPC1 in 22 seconds; the resulting model is not equivalent to PPC0.
To reproduce:
$ racket ppc1.rkt
This program will first synthesize PPC0
(with the same output as for ppc0.rkt
above),
and then synthesize PPC1.
The synthesis for PPC1 uses fewer tests than for PPC0,
because some litmus tests are inconclusive when run on the original hardware.
The synthesis for PPC1 again prints a message of the form:
Synthesis complete!
time: 16383 ms
tests used: 15/588
Then, to prove that PPC0 and PPC1 are not equivalent, the program synthesizes a new test that distinguishes the two:
Test equiv0
=============================
P0 | P1
-----------------------------
r1 <- [B] | r2 <- [A]
sync | r3 <- r2 ^ r2
[A] <- 1 | [B+r3] <- 1
=============================
r1=1 /\ r2=1
Allowed by PPC_0: #f
Allowed by PPC_1: #t
This test is allowed by PPC1, but not by PPC0, proving the two models are not equivalent.
Experimenting: we encourage experimentation with the PowerPC synthesis
case study. The file ppc0.rkt
defines the set of tests used, drawn from
the set of all PowerPC tests in ~/memsynth/litmus/tests/ppc-all.rkt
.
This collection can be modified to, for example, exclude all tests containing
certain instructions. See ~/memsynth/litmus/tests/ppc.rkt
for examples.
The sketch in sketch.rkt
can also be modified.
For example, the numeric arguments to make-ppo-sketch
etc.
control the depth of expression sketch used.
The ab
sketch could also be modified to not contain the domain knowledge
about the two barriers being independent
(though this would require adding ^
to the list of non-terminals in the grammar).
x86
Section 6.1.2 applies MemSynth to the x86 architecture. We show that we can synthesize a model for x86 in 2 seconds using litmus tests from the Intel manual, and automatically identify several missing tests from that documentation.
To begin, enter the directory for this case study:
$ cd ~/memsynth/case-studies/synthesis/x86
$ ls
compiled redundancy.rkt sketch.rkt tso0.rkt tso0-unique.rktcd
The sketch.rkt
file contains the sketch we use to synthesize x86 models.
Unlike for PowerPC, we do not model fences for x86,
and so the sketch is much simpler.
Claim: MemSynth synthesizes a formalization TSO0 that is correct on the Intel manual’s 10 litmus tests in under two seconds.
To reproduce:
$ racket tso0.rkt
As with the PowerPC experiments, this program prints a message of the form
Synthesis complete!
time: 2363 ms
tests used: 6/10
followed by the synthesized model, and verifies that the model agrees with the documentation:
Verifying solution...
Verified 10 litmus tests
Claim: MemSynth’s uniqueness query determines that at least one other memory model also satisfies all 10 tests, while disagreeing with TSO0 on a new distinguishing test … repeating the uniqueness query after adding this new test finds 2 more distinguishing tests.
To reproduce (takes ~4 hours; see below for a smaller version):
$ racket tso0-unique.rkt
As with the PowerPC uniqueness experiment above, this program first synthesizes TSO0 again, and then begins identifying new tests, which it outputs in the form:
New distinguishing test [#1]:
Test disambig0
=============================
P0 | P1
-----------------------------
r1 <- [B] | LOCK [A] <- 1
r2 <- [A] | LOCK [B] <- 1
=============================
r1=1 /\ r2=0
Allowed by oracle? #f
The search finds a total of 3 such tests.
As with the PowerPC uniqueness experiment,
no output is produced between distinguishing tests.
Pass the -v
argument to the program to see more frequent verbose output.
Smaller version: run
$ racket tso0-unique.rkt -s
to execute a version of the experiment with smaller bounds on the symbolic litmus test. This version will complete in 4 minutes, and finds the same 3 distinguishing tests as the full version, but the uniqueness claim is less strong because fewer candidate tests are considered.
Claim: Sarkar et al. [32] write that “P8 may be redundant” … we found that if we omit [the] two tests [associated with P8] … the ambiguity experiment rediscovers them, suggesting they are needed to uniquely identify x86’s memory model.
To reproduce (takes ~4 hours; see below for a smaller version):
$ racket redundancy.rkt
This program first synthesizes an x86 model without the tests in question, and then searches for new distinguishing tests. While the previous x86 uniqueness experiment finds 3 such tests, this experiment finds 5. The two additional tests are exactly the two that were omitted.
As with the other TSO uniqueness experiment,
no output is produced between distinguishing tests.
Pass the -v
argument to the program to see more frequent verbose output.
Smaller version: run
$ racket redundancy.rkt -s
to execute a version of the experiment with smaller bounds on the symbolic litmus test. This version will complete in 4 minutes, and finds the same 5 distinguishing tests as the full version, but the uniqueness claim is less strong because fewer candidate tests are considered.
Framework repair
Section 6.2 of the paper uses MemSynth to identify and repair a bug in a paper by Mador-Haim et al.
To begin, enter the directory for this case study:
$ cd ~/memsynth/case-studies/repair
$ ls
allowed.rkt axioms.rkt compiled enumerate.rkt repair.rkt
The file axioms.rkt
contains an implementation of the Mador-Haim memory model
framework, but with a hole (called X
) standing in for the precondition of
the suspect “Ignore local” axiom.
Running repair.rkt
tries various completions of this hole, and then
synthesizes a completion for the whole that makes the model correct.
To reproduce:
$ racket repair.rkt
This program will produce output, described below, that addresses all the claims in Section 6.2.2. The program itself is quite readable, and we encourage reviewing it to determine exactly how it addresses the claims.
Claim: The paper states there should be 82 distinct models, but our implementation found only 12 distinct models
The first output from repair.rkt
tries using the version of the axiom written
in the paper, and reports:
Original axiom, as stated in paper: 12 distinct models
Claim: Omitting this axiom from our implementation gave 86 distinct models, not 82 as expected
The next output tries omitting the axiom (by settings its precondition to false), and reports:
With axiom omitted: 86 distinct models
Claim: Our implementation reported this test [test mh/L2
] to be disallowed
by all 90 memory models.
The program reports:
Models that allow test/madorhaim/L2: 0
Models that forbid test/madorhaim/L2: 90
Claim: We replaced [the precondition] with an expression sketch of depth 3, and synthesized a completion that gave the correct outcomes on 9 litmus tests from the original paper on both TSO and RMO memory models. We were able to synthesize the … repair in 15 seconds
The actual expression sketch is on lines 45–46 of repair.rkt
:
(define X (expression-sketch 3 2 (list + - & -> SameAddr ~)
(list MemoryEvent Reads Writes po rf)))
The output from repair.rkt
first reports the correct outcomes for the 9
litmus tests from the original paper, as determined using the Alglave et al
framework.
Then, it synthesizes a completion for X
using those outcomes as the specification,
and reports:
Found a repair: (& (~ (- po rf)) (~ (-> Writes Reads)))
Time: 3268 ms
This repair is equivalent to the one stated in the paper after simplification
(lifting ~
outside &
), and due to changes since submission, the performance
is much improved.
Claim: With the repaired axiom, our pairwise comparison results produce 82 distinct models, identical to the original paper.
The final output from repair.rkt
states:
Produces 82 distinct models.
Models that allow test/madorhaim/L2: 36
Models that forbid test/madorhaim/L2: 54
These results confirm 82 distinct models, and that some models now allow the
test mh/L2
that was forbidden by every model when using the original axiom.
Performance
Section 6.3 characterizes the performance of the MemSynth engine compared to existing memory model and relational logic tools.
Verification
To begin, enter the directory for this experiment:
$ cd ~/memsynth/case-studies/performance/verification
$ ls
alloy alloy.sh herd herd.sh memsynth.sh
The verification experiment compares the performance of verifying whether the Alglave et al PowerPC model allows or forbids the 768 litmus tests used in the PowerPC synthesis experiments above.
The paper compares MemSynth to two different tools:
Alloy (in the alloy
directory) uses a PowerPC model exported from MemSynth;
Herd (the herd
directory) is a custom-built memory model tool
by Alglave and Maranget that already supports PowerPC.
Claim: The results show that MemSynth outperforms Alloy by 10×, and is comparable to herd’s custom decision procedure for memory models.
To run each tool on the 768 tests, execute its respective shell script:
$ ./memsynth.sh
Verifying with Memsynth...
real 0m2.769s
user 0m2.592s
sys 0m0.164s
$ ./herd.sh
Verifying with Herd...
real 0m2.538s
user 0m2.332s
sys 0m0.108s
$ ./alloy.sh
Verifying with Alloy...
real 0m49.119s
user 1m19.312s
sys 0m1.556s
These results show that MemSynth and Herd have very similar performance, while Alloy is at least 10× slower than both tools.
Equivalence
To begin, enter the directory for this experiment:
$ cd ~/memsynth/case-studies/performance/equivalence
$ ls
alloy alloy.sh memsynth memsynth.sh models plot.py
The equivalence experiment compares the performance of synthesizing distinguishing litmus tests for each pair of 10 candidate PowerPC models (which were generated during synthesis of the PowerPC model above).
The paper compares MemSynth to Alloy*, a higher-order version of Alloy. The experiment tries to synthesize a distinguishing test for all 45 pairs of models using both tools, and compares the distribution of synthesis times for each tool.
Claim: MemSynth outperforms Alloy* on most of these queries: MemSynth can solve twice as many queries in under one second, and the hardest problem takes 25 seconds for MemSynth versus 18 min for Alloy*.
Both experiments are run by shell scripts, which output the performance for each pair of candidate models to a CSV file in the current directory.
To run the experiment for MemSynth (takes ~3 min):
$ ./memsynth.sh
To run the experiment for Alloy* (takes ~15 min):
$ ./alloy.sh
Finally, to plot and analyze the results:
$ ./plot.py
This command outputs the maximum time taken to compare two models by both MemSynth and Alloy, and opens a plot that corresponds to Figure 8(b) in the paper. Note that for tractability, we set a lower timeout for the Alloy experiment (2 minutes here versus 1 hour in the paper), so the x-axis is shorter than in the paper, the maximum time is lower, and Alloy* does not not successfully complete all 45 pairs.
Synthesis
To begin, enter the directory for this experiment:
$ cd ~/memsynth/case-studies/performance/synthesis
$ ls
alloy alloy.sh memsynth memsynth.sh
The synthesis experiment compares the performance of synthesizing a memory model from a trivial sketch. The experiment uses three litmus tests from the Intel manual, and tries to synthesize a model from a sketch that simply consists of binary choices between two or three models.
The paper compares MemSynth to Alloy* on the synthesis task. For both tools, we first try to synthesize a model from a sketch that chooses between sequential consistency (SC) and total store order (TSO). Then, in a second experiment, we add partial store order (PSO) to the set of choices.
Claim: When given {SC, TSO}, both MemSynth and Alloy* return in under a second. However, when given {SC, TSO, RMO}, MemSynth still returns in under a second, but Alloy* times out after one hour.
To run the experiment for MemSynth:
$ ./memsynth.sh
The output shows both models can be synthesized extremely quickly (under 20 milliseconds).
To run the experiment for Alloy*:
$ ./alloy.sh
The output shows that the choice between SC and TSO can be made rapidly, but choosing between SC, TSO, and PSO times out after 5 minutes (we set the timeout lower than in the paper for tractability).
Note that the experiment uses PSO rather than RMO as mentioned in the paper, for simplicity of export to Alloy*, but the results are the same.