Fray: The CMU Tool That Finds Concurrency Bugs Your Tests Miss — and Replays Them

Eleftheria DrosopoulouMarch 16th, 2026Last Updated: March 13th, 2026

0 232 9 minutes read

Race conditions and deadlocks are famous for appearing once in production and never again in tests. Carnegie Mellon’s Fray is built specifically to change that — by controlling the scheduler itself and writing down exactly what it found.

You have almost certainly seen this before: a test fails once, you re-run it, it passes. You mark it flaky, add a retry, and move on. Three weeks later the same intermittent failure lands in production — silently corrupting state at 3 AM on a Friday. Concurrency bugs are not random. They are deterministic failures waiting for exactly the right interleaving of threads, which normal test execution almost never produces. Fray, a new tool from Carnegie Mellon University’s PASTA lab, takes a fundamentally different approach: instead of hoping your threads will collide in the right way, Fray controls the scheduler itself and deliberately steers execution toward the interleavings most likely to trigger a bug.

Better still, when Fray finds something, it writes down exactly how it found it. You can replay the failure deterministically, every time, until you fix it. The tool was formally published at OOPSLA 2025 and is already available on Maven Central and as a Gradle plugin, starting at version 0.7.3.

1. Why Ordinary Tests Don’t Catch Concurrency Bugs

To understand why Fray exists, it helps to understand what makes concurrency bugs so persistent. When a multi-threaded program runs, the operating system’s scheduler decides which thread runs at any given moment. On a modern machine, there are millions of possible orderings — interleavings — for even a short test. The specific ordering that triggers a race condition or deadlock might have a probability of, say, one in ten thousand. Run your test suite ten thousand times and you will probably see it. Run it once or twice in CI, as most teams do, and you almost certainly will not.

Traditional approaches to this problem fall into two camps. The first is data-race detection, which looks for unsynchronised reads and writes to shared memory — tools like Java’s -javaagent-based ThreadSanitizer port fall here. Useful, but limited: many concurrency bugs occur in perfectly well-synchronised code that nonetheless has a logical ordering error. The second approach is exhaustive model checking — tools like Java PathFinder (JPF) try to enumerate all possible interleavings. Sound in theory, but in practice JPF cannot run on contemporary Java at all; the OOPSLA paper notes that JPF simply throws internal errors on every one of the 2,655 real-world tests they attempted.

Fray takes a third path. Rather than detecting data races or exhaustively enumerating states, it performs controlled concurrency testing: it takes over the JVM’s thread scheduler at runtime and deliberately replays the test many times, each time choosing a different interleaving based on one of several smart search algorithms. It is probabilistic, not exhaustive — but the algorithms are designed to maximise the chance of hitting rare, bug-inducing orderings in a small number of iterations.

2. How Fray Works: Shadow Locking Explained

The central technical innovation in Fray is a mechanism called shadow locking. This is worth understanding because it is what allows Fray to work on real, production-grade code where previous tools consistently failed.

Earlier tools that tried to control thread scheduling on the JVM fell into one of two traps. Some replaced Java’s concurrency primitives — synchronized, ReentrantLock, CountDownLatch, and so on — with mock implementations that the tool could control. This sounds appealing, but it breaks in practice: the moment your code interacts with any third-party library or JDK class that uses its own synchronisation internally, the mocks stop being accurate models of what would actually happen in production. Others intercepted threads at the operating system level (like Mozilla’s rr), which is powerful but extremely heavyweight and difficult to run in a typical CI environment.

Shadow locking avoids both problems. Instead of replacing concurrency primitives, Fray instruments your bytecode to wrap them. Each synchronisation point — a lock acquisition, a thread start, a wait/notify call — gets an additional shadow lock injected around it. The shadow lock is always held initially by Fray’s own scheduler thread. When the scheduler decides it is time for thread T to proceed, it releases T’s shadow lock, allowing T to run to its next synchronisation point. In this way, Fray controls the ordering of every meaningful concurrency event in the program without changing the semantics of any individual primitive. The real lock still does exactly what it always did; Fray simply decides when each thread is allowed to reach it.

1. Bytecode instrumentation at load time: Fray intercepts class loading and injects shadow lock callbacks around every concurrency event — lock acquire, thread start, wait, notify, volatile read/write, and atomic operations.

2. Scheduler takes full control: A dedicated Fray scheduler thread holds all shadow locks at startup. Threads can only proceed when the scheduler releases their specific shadow lock, so at most one thread makes progress at any time.

3. Search algorithm steers the interleavings: At each scheduling decision point, the algorithm (Random, PCT, POS, or SURW) selects which thread to unblock next. Each iteration of the test uses a different strategy, maximising coverage of the interleaving space.

4. Bug found: schedule is serialised and saved: When an assertion failure, uncaught exception, or deadlock occurs, Fray writes the exact sequence of scheduling decisions to a replay file. The bug can then be reproduced deterministically on demand.

Fray makes two key assumptions to provide soundness guarantees: the target code should be data-race free (i.e., you are not writing to the same field from two threads without synchronisation), and external non-determinism such as randomness or networked I/O should be minimal. Both assumptions can be relaxed — Fray will still run — but the completeness guarantee weakens.

3. The Search Algorithms: More Than Random Luck

One of Fray’s practical strengths is that it is not limited to a single scheduling strategy. Because the search algorithm is cleanly separated from the concurrency control mechanism, Fray can plug in different algorithms depending on the kind of bug you are hunting. The current version ships with four, and adding a new one reportedly takes around 200 lines of code — the SURW algorithm, published at ASPLOS 2025, was integrated by one author in a single day.

Algorithm	How It Schedules	Best For	Guarantee
Random Walk	Uniformly picks any enabled thread at each step	General-purpose; good first pass	None, but surprisingly effective
PCT Probabilistic	Assigns random priorities; demotes a thread at d chosen points	Bugs requiring d+1 specific orderings	P(bug) ≥ 1/n^d per iteration
POS Best performer	Reassigns random priorities whenever a thread competes for a resource	Atomicity violations, order violations	Probabilistic; finds 363 bugs in eval
SURW Newest	Weights threads by number of “interesting” events remaining	Bugs near specific synchronisation points	Selectively uniform coverage

In practice, POS — Partial Order Sampling — is the standout algorithm. In the OOPSLA evaluation, Fray running POS found reproducible bugs in 363 tests across Kafka, Lucene, and Guava, requiring an average of just 190 iterations to identify each one. That is a meaningful number: 190 test executions taking perhaps a few seconds each adds up to minutes of CI time, not hours.

Benchmark evaluation — SCTBench & JaConTeBe (53 programs)

Source: OOPSLA 2025 paper, Table 3. Percentages represent proportion of 53 known-bug benchmarks where each tool successfully detected the bug within a fixed iteration budget.

4. Real Bugs Found in Real Projects

Numbers in papers can feel abstract. What makes Fray’s results particularly striking is where it found bugs: not in toy programs, but in Apache Kafka, Apache Lucene, and Google Guava — three of the most actively maintained and thoroughly tested open-source Java projects in existence. Fray successfully discovered 18 real-world concurrency bugs that can cause 371 of the existing tests to fail under specific interleavings.

Of those 18 bugs, the team reported all of them to the respective project maintainers with detailed reproduction instructions. At the time of the paper’s publication, 11 had been confirmed and 7 had already been fixed. The breakdown by bug type is revealing: six were atomicity violations (a sequence of operations that must happen together being interleaved), five were order violations (a dependency on a specific thread ordering that was never enforced), five were thread leaks (threads that never terminated under certain interleavings), one involved an unhandled spurious wakeup, and one was still under investigation.

AWS Labs published a blog post describing how Fray helped them find and diagnose concurrency bugs in Apache Lucene by running existing off-the-shelf unit tests with Fray’s POS algorithm — no new test code required. This is precisely the tool’s intended use case: point it at tests you already have, and let it find what standard execution misses.

Real-world evaluation — Kafka, Lucene, Guava

**Bug types discovered across 18 confirmed concurrency bugs**: Source: OOPSLA 2025 paper, Section 5.3

5. Getting Started: Integration in Five Minutes

Fray is designed for low-friction adoption. If you are already using JUnit 5, the integration is a two-step annotation change. You do not need to rewrite your tests or mock out your threads — Fray wraps the existing test execution transparently.

Gradle setup

Add the Fray plugin to your build.gradle or build.gradle.kts file:

plugins {
    id("org.pastalab.fray.gradle") version "0.7.3"
}

Then add the JUnit integration dependency to your test scope:

dependencies {
    testImplementation("org.pastalab.fray:fray-junit:0.7.3")
}

Maven setup

<plugin>
  <groupId>org.pastalab.fray.maven</groupId>
  <artifactId>fray-plugins-maven</artifactId>
  <version>0.7.3</version>
  <executions>
    <execution>
      <id>prepare-fray</id>
      <goals><goal>prepare-fray</goal></goals>
    </execution>
  </executions>
</plugin>

<dependency>
  <groupId>org.pastalab.fray</groupId>
  <artifactId>fray-junit</artifactId>
  <version>0.7.3</version>
  <scope>test</scope>
</dependency>

Annotating a JUnit 5 test

Mark your existing test class with @ExtendWith(FrayTestExtension.class) and the specific test methods you want Fray to analyse with @ConcurrencyTest. Fray will run each annotated method multiple times, varying the thread schedule on each iteration:

import org.pastalab.fray.junit.junit5.FrayTestExtension;
import org.pastalab.fray.junit.junit5.annotations.ConcurrencyTest;

@ExtendWith(FrayTestExtension.class)
public class AccountTransferTest {

    @ConcurrencyTest(iterations = 200)
    public void transferShouldNeverLoseMoney() {
        Account a = new Account(100);
        Account b = new Account(100);

        Thread t1 = new Thread(() -> a.transferTo(b, 50));
        Thread t2 = new Thread(() -> b.transferTo(a, 30));

        t1.start();
        t2.start();
        t1.join();
        t2.join();

        assert a.balance() + b.balance() == 200;
    }
}

Replaying a failure

When Fray finds a bug, it writes a recording file into the report folder — typically target/fray/fray-report/ for Maven builds. To replay that exact failure on demand, pass the recording path back to the annotation:

@ConcurrencyTest(
    replay = "target/fray/fray-report/recording"
)

Running the test now will reproduce the failing interleaving every single time, giving you a stable target to debug against. This is, in many ways, the most practically valuable part of Fray: not just finding the bug, but eliminating the “it only happens sometimes” excuse entirely.

Fray also ships an IntelliJ IDEA debugger plugin that can load a replay file and step through the recorded thread interleaving inside the IDE, showing you exactly which thread was running at each point. For teams that prefer visual debugging over log analysis, this is worth exploring separately.

6. Virtual Threads and Why Fray Matters Right Now

Virtual threads — introduced as a preview in JDK 19 and stabilised in JDK 21 — have been one of the most practically impactful Java features in years. They make it cheap to have thousands of concurrent tasks in flight simultaneously, which is great for throughput but subtly dangerous for correctness. When your code was running on a handful of platform threads, certain timing-dependent bugs simply never had the opportunity to manifest — the scheduler just never hit the unlucky ordering. Move that same code to virtual threads and suddenly you have hundreds more threads competing for the same resources, and your latent concurrency bugs start appearing in tests for the first time.

This is precisely the scenario where Fray shines. Fray supports JDK versions up to 25, works with virtual threads, and does not require you to know in advance which part of your code is broken. You point it at an existing test, tell it to run 200 iterations with POS, and it will systematically explore the interleaving space that your single-run CI pass never touches. For teams that made the move to virtual threads and suddenly started seeing intermittent test failures they cannot reproduce, Fray is the most direct path to an answer.

7. How Fray Compares to the Alternatives

It is worth being clear about where Fray sits relative to other tools teams might already know, because there is meaningful overlap in what these tools promise but important differences in what they deliver.

Tool	Approach	Modern Java Support	Replay Bugs?	Finds Logical Races?
Fray Recommended	Bytecode instrumentation + shadow locking	JDK 11–25 ✓	Yes — deterministic	Yes — atomicity & ordering
JPF	Custom JVM (VM hacking)	Fails on modern JDK	Partial	Yes
rr + chaos	OS-level record & replay	Linux only	Yes	Limited by OS scheduling granularity
Lincheck	Concurrency primitive mocking	Active ✓	Partial	Limited — breaks with third-party libs
ThreadSanitizer	Data race detection	JVM port exists	No	No — data races only

The honest caveat is that Fray is not a silver bullet. It does not find data races (for that, use a race detector). It does not do exhaustive state-space exploration — there is no guarantee it will find every possible bug in a finite run. And its assumption that the target code is data-race free means that if your code has unsynchronised memory access, the results might be unsound. For most production Java code that uses proper synchronisation, however, these limitations are rarely the bottleneck. The bugs that matter most are the logical concurrency errors — the ones that pass code review, compile cleanly, and only show up at scale — and those are exactly what Fray is designed to find.

8. What We Have Learned

Concurrency bugs have always been the category of failure that teams learn to live with rather than fix, because the tools to reproduce them reliably simply did not exist for modern Java. Fray changes that. By using shadow locking to take control of the JVM’s thread scheduler at bytecode level — without replacing or mocking any concurrency primitives — the CMU PASTA lab has built something that previous tools only promised: a concurrency tester that actually runs on contemporary Java, works with existing JUnit tests, finds significantly more bugs than both JPF and rr’s chaos mode, and produces a deterministic replay file the moment it finds something. The empirical results speak plainly — 18 confirmed real-world bugs in Kafka, Lucene, and Guava, 371 tests shown to be broken under specific interleavings, all from projects with extensive existing test suites. If your team moved to virtual threads in JDK 21 and started noticing test instability you cannot pin down, adding @ConcurrencyTest to your suspicious tests and running 200 iterations with POS is now a legitimate, low-effort first step.

Fray: The CMU Tool That Finds Concurrency Bugs Your Tests Miss — and Replays Them

1. Why Ordinary Tests Don’t Catch Concurrency Bugs

2. How Fray Works: Shadow Locking Explained

3. The Search Algorithms: More Than Random Luck

4. Real Bugs Found in Real Projects

5. Getting Started: Integration in Five Minutes

Gradle setup

Annotating a JUnit 5 test

Replaying a failure

6. Virtual Threads and Why Fray Matters Right Now

7. How Fray Compares to the Alternatives

8. What We Have Learned

Thank you!

Eleftheria Drosopoulou

Thank you!

1. Why Ordinary Tests Don’t Catch Concurrency Bugs

2. How Fray Works: Shadow Locking Explained

3. The Search Algorithms: More Than Random Luck

4. Real Bugs Found in Real Projects

5. Getting Started: Integration in Five Minutes

Gradle setup

Annotating a JUnit 5 test

Replaying a failure

6. Virtual Threads and Why Fray Matters Right Now

7. How Fray Compares to the Alternatives

8. What We Have Learned

Thank you!

Related Articles

Thank you!