Experiment 088: SQLITE_ENABLE_SETLK_TIMEOUT — kernel-blocking WAL waits
Date: 2026-04-20
Status: Rejected (multi-run confirmation, downgraded from Accepted)
Archive:archive/exp-088-setlk-timeout
> Numbering note: originally authored as "exp 084," renumbered to 088
> to avoid collision with the concurrently-developed
> exp-084-columnar-redux branch. Numbers 083, 085, 086 are already
> claimed by other in-flight branches; 087 was used by a parallel
> VM-pragma audit. 088 was the next free slot on the remote at the
> time of the rename.
Problem
Exp 080 (dispatch budget) surfaced a 34 ms single-insert outlier in the
p99/max tail (2000× the 16 μs median). The prime suspect was a passive
WAL checkpoint firing mid-insert and blocking the writer. SQLite's
default handling under lock contention in WAL mode is to call
sqlite3WalBusy / the busy_timeout retry loop, which spin-retries with
sqlite3_sleep-style backoff. That loop keeps the thread runnable, so
the OS scheduler still races it against the checkpointer and produces
large tails when things go wrong.
The research pass research-2026-04-20-peer-and-platform-survey.md §T1
flagged sqlite3_setlk_timeout() — new in SQLite 3.50.0 — as a
low-touch way to turn that polling retry into a kernel blocking wait on
VFSes that support blocking locks (unix-dotfile on POSIX). The kernel
then wakes the waiter immediately when the checkpointer releases the
lock, instead of waiting for the next poll tick.
We ship SQLite 3.51.3 via sqlite3mc_amalgamation, which already
contains the API. The feature is gated behind a compile flag
(SQLITE_ENABLE_SETLK_TIMEOUT), and has to be opted into per connection
via a runtime call.
Hypothesis
Enabling SQLITE_ENABLE_SETLK_TIMEOUT and calling
sqlite3_setlk_timeout(db, 10000, 0) after every successful
sqlite3_open_v2 should:
- leave p50 untouched (no lock contention in the happy path),
- meaningfully cut p99 and max on write-heavy workloads where the
checkpointer can collide with an in-flight write, and
- also improve reader p99 because readers block in the kernel when they
hit a checkpointer-held lock instead of spinning through
busy_timeout.
We expected the single_insert 34 ms outlier (and similar outliers on
merge_rounds / noop) to collapse toward the p90 band.
Approach
Three small edits, gated and ordered to preserve the existing
busy_timeout as a graceful fallback:
hook/build.dart— added'SQLITE_ENABLE_SETLK_TIMEOUT': null,
to the defines: map alongside the other SQLITE_ENABLE_* flags.
native/resqlite.c— insideopen_connection, after
PRAGMA busy_timeout = 5000, added:
```c
#ifdef SQLITE_ENABLE_SETLK_TIMEOUT
int setlk_rc = sqlite3_setlk_timeout(db, 10000, 0);
if (setlk_rc != SQLITE_OK) {
fprintf(stderr, "resqlite: sqlite3_setlk_timeout returned %d (non-fatal)\n", setlk_rc);
}
#endif
```
The call applies to both the writer and every reader connection
because they all flow through open_connection. We intentionally do
not pass SQLITE_SETLK_BLOCK_ON_CONNECT — we want eligible lock
waits to block, not the one-to-zero-connections checkpoint dance.
- No FFI binding change — the call lives entirely in C, so
lib/src/native/resqlite_bindings.dart and the _exportedSymbols
list in hook/build.dart stay untouched.
busy_timeout = 5000 remains in place, so on VFSes that don't support
blocking locks (Windows, non-dotfile Unix) the call is a documented
no-op and behaviour is unchanged.
Results
Multi-run confirmation (5 runs per side, medians of percentiles)
The initial single-run numbers (§ "Initial single-run results" below) showed
huge p99/max wins on every write workload. Single-run p99/max is noisy — one
GC pause or scheduler hiccup can swing it by an order of magnitude — so we
reran 5× per variant and take the median of each percentile across the 5
runs on each side. This is the methodology referenced in
benchmark/EXPERIMENTS.md § "Three things to keep in mind" point 1.
Baseline glob: benchmark/profile/results/baseline-exp088-run{1..5}.json
Candidate glob: benchmark/profile/results/exp-088-run{1..5}.json
Aggregator: benchmark/profile/diff_multirun.dart
Median-across-runs, baseline vs candidate:
| workload · op | metric | baseline median | candidate median | Δ | Δ% | baseline CV |
|---|---|---|---|---|---|---|
| single_insert · execute | p50 | 17 μs | 18 μs | +1 μs | +5.9% | 2.4% |
| single_insert · execute | p90 | 23 μs | 31 μs | +8 μs | +34.8% | 7.0% |
| single_insert · execute | p99 | 67 μs | 182 μs | +115 μs | +171.6% | 22.2% |
| single_insert · execute | max | 1246 μs | 3918 μs | +2672 μs | +214.4% | 22.8% |
| single_insert · execute | work | 6 μs | 6 μs | +0 μs | +0.0% | 14.9% |
| merge_rounds · executeBatch | p50 | 105 μs | 109 μs | +4 μs | +3.8% | 0.9% |
| merge_rounds · executeBatch | p90 | 121 μs | 141 μs | +20 μs | +16.5% | 5.5% |
| merge_rounds · executeBatch | p99 | 294 μs | 381 μs | +87 μs | +29.6% | 17.4% |
| merge_rounds · executeBatch | max | 759 μs | 980 μs | +221 μs | +29.1% | 18.8% |
| merge_rounds · executeBatch | work | 95 μs | 97 μs | +2 μs | +2.1% | 1.2% |
| noop · execute | p50 | 11 μs | 12 μs | +1 μs | +9.1% | 6.9% |
| noop · execute | p90 | 20 μs | 22 μs | +2 μs | +10.0% | 6.6% |
| noop · execute | p99 | 65 μs | 128 μs | +63 μs | +96.9% | 16.8% |
| noop · execute | max | 605 μs | 3807 μs | +3202 μs | +529.3% | 84.0% |
| noop · select | p50 | 7 μs | 7 μs | +0 μs | +0.0% | 7.4% |
| noop · select | p90 | 15 μs | 16 μs | +1 μs | +6.7% | 9.7% |
| noop · select | p99 | 48 μs | 98 μs | +50 μs | +104.2% | 13.8% |
| noop · select | max | 970 μs | 5757 μs | +4787 μs | +493.5% | 143.7% |
| point_query · select | p50 | 7 μs | 7 μs | +0 μs | +0.0% | 5.6% |
| point_query · select | p90 | 11 μs | 12 μs | +1 μs | +9.1% | 5.7% |
| point_query · select | p99 | 27 μs | 47 μs | +20 μs | +74.1% | 20.6% |
| point_query · select | max | 565 μs | 1944 μs | +1379 μs | +244.1% | 117.6% |
CV = coefficient of variation of the baseline's 5 per-run values for that
metric. Lower CV means the metric is stable run-to-run; higher CV means a
single-run number on that metric is largely noise. The max CVs on the
baseline range from 22.8 % (single_insert) to 143.7 % (noop select), which
tells you single-run max is effectively unusable as a signal on this bench
— one bad run dominates everything. p99 is less volatile but still noisy
(13–23 % on writes, up to 20 % on point_query).
Findings — the single-run wins do not replicate:
- p50 is unchanged to very slightly worse on every workload (+0 to
+1 μs, within one-tick noise of the dispatch floor). Consistent with the
hypothesis that p50 is not lock-bound — also consistent with p50 being
the one metric where setlk_timeout shouldn't matter.
- p90 ranges from flat to ~+35 % worse on write workloads. The
single-run numbers claimed p90 wins of -24 % to -64 %; neither
direction's magnitude is defensible at the multi-run median.
- p99 is 74 – 172 % WORSE at the median, the opposite of the
single-run claim. Candidate p99 on single_insert is 182 μs vs baseline
67 μs. The original baseline's 678 μs p99 and 116 409 μs max never
reappear in any of the 5 re-runs (baseline max range: 1104–1985 μs) —
that 116 ms baseline outlier was a single bad measurement, not a
property of the code under test.
- max is 29 – 529 % WORSE at the median, though max CV is so high
(up to 144 %) that the median isn't meaningful — we mostly learn that
the candidate does not reliably suppress tail spikes.
- point_query regressions in the single run were noise: single-run
showed +25 % p50 / +12 % max (interpreted as "within noise"); multi-run
shows +0 % p50 but persistent +74 % p99 / +244 % max. The p50 part
was noise; the p99/max part was actually directionally correct but
is now also the wrong sign from what we'd expected.
- noop · execute max went from baseline 605 μs median to candidate
3807 μs median — a catastrophic regression on the no-op path, where
setlk_timeout should be a pure no-op. This suggests the call is
actually doing something and it's not good.
Per-run p99 variability (coefficient of variation across the 5 baseline
runs) is 13.8 – 22.2 % on the write workloads. That alone means a one-run
p99 delta smaller than ~25 % is indistinguishable from noise — and the
single-run deltas we originally reported (−21 % to −81 % on p99) sit
well inside that noise envelope on the candidate side too.
Initial single-run results (kept for historical reference)
Original single-run numbers from
benchmark/profile/results/baseline-exp088.json vs exp-088.json:
## merge_rounds TIME executeBatch: p50 125μs → 119μs -6μs (-4.8%) p90 663μs → 501μs -162μs (-24.4%) p99 3366μs → 2661μs -705μs (-20.9%) max 44588μs → 9761μs -34827μs (-78.1%) ## noop TIME execute: p50 13μs → 11μs -2μs (-15.4%) p99 1619μs → 418μs -1201μs (-74.2%) max 319996μs → 12888μs -307108μs (-96.0%) TIME select: p50 8μs → 6μs -2μs (-25.0%) p99 1021μs → 230μs -791μs (-77.5%) max 203748μs → 29901μs -173847μs (-85.3%) ## point_query TIME select: p50 8μs → 10μs +2μs (+25.0%) p99 157μs → 130μs -27μs (-17.2%) max 4664μs → 5258μs +594μs (+12.7%) ## single_insert TIME execute: p50 17μs → 18μs +1μs (+5.9%) p99 678μs → 131μs -547μs (-80.7%) max 116409μs → 3322μs -113087μs (-97.1%) These numbers look excellent in isolation but do not survive 5× re-runs.
The baseline outliers (678 μs p99 / 116 ms max on single_insert; 320 ms
max on noop execute; 204 ms max on noop select) do not appear in any of
the 5 baseline re-runs, meaning the original single-run baseline captured
a transient bad state (cold cache, first-run JIT, unrelated macOS
scheduling) that we mistook for a property of main. The candidate p99/max
on the 5 re-runs land worse than the baseline p99/max, which is the
opposite of what the single run suggested.
Decision
Rejected. Downgraded from "Accepted" after multi-run confirmation.
The large single-run p99/max wins were noise — the original baseline
captured a transient outlier that never reproduced in 5 re-runs. At the
median-of-5-runs level, the candidate is measurably worse than baseline
on every tail metric on every workload: p99 +74 % to +172 % worse on
writes, max +29 % to +529 % worse, and noop regresses even though it has
no WAL lock contention to resolve.
The fact that noop · execute and noop · select regress at p99/max is
the strongest evidence this is not a neutral change. These workloads
have no concurrent writer-plus-checkpointer contention for setlk_timeout
to suppress, so the flag should be an observationally-invisible no-op.
It isn't.
Root cause (verified in SQLite source)
Enabling SQLITE_ENABLE_SETLK_TIMEOUT is not just "add a timeout
option to an otherwise-unchanged lock path." It changes the SHM
locking architecture itself. From the amalgamation source in
third_party/sqlite3mc/sqlite3mc_amalgamation.c around line 43893:
> Normally, when SQLITE_ENABLE_SETLK_TIMEOUT is not defined, mutex
> pShmMutex is used to protect the aLock[] array and the right to
> call fcntl() on unixShmNode.hShm to obtain or release locks.
>
> If SQLITE_ENABLE_SETLK_TIMEOUT is defined though, we use an array
> of mutexes — one for each locking slot. To read or write locking
> slot aLock[iSlot], the caller must hold the corresponding mutex
> aMutex[iSlot]. Similarly, to call fcntl() to obtain or release
> a lock corresponding to slot iSlot, mutex aMutex[iSlot] must be
> held.
Concretely, per WAL lock operation:
| Flag off (baseline) | Flag on (candidate) | |
|---|---|---|
| SHM mutex structure | 1 shared pShmMutex | SQLITE_SHM_NLOCK (= 8) per-slot mutexes |
fcntl syscall | F_SETLK (non-blocking) + busy-retry | F_SETLKW (blocking with timeout bookkeeping) |
| Per-lock state | none | iBusyTimeout read/saved/restored per op |
unixShmNode struct size | smaller | SQLITE_SHM_NLOCK extra mutex pointers |
This fully explains the noop regression. An UPDATE ... WHERE 1 = 0
still runs a WAL commit, which acquires the WAL-index SHM write lock
even with zero frames to append. In the baseline, that is one
pShmMutex lock + one non-blocking F_SETLK + release — two
lightweight ops. In the candidate, the same commit goes through the
per-slot mutex array and issues F_SETLKW, which is a heavier syscall
path even when the lock is immediately grantable, plus the
iBusyTimeout save/restore dance around it. The cost is paid on
every write, contended or not. The regression shape (tens of μs added
to p99 of every write workload, matching noop) matches this
prediction exactly.
Readers pay the same kind of cost on their WAL read-lock slot
registration, which is why noop · select also regressed.
Platform note
The feature was designed primarily for Linux with F_OFD_SETLKW (OFD
locks + itimer-backed timeout), where the blocking path is native and
cheap. macOS does not support F_OFD_* locks — it falls back to
POSIX F_SETLKW, which is heavier. So the same experiment on
Android/Linux with heavy concurrent readers plus active checkpointing
could still be a win — but on macOS with a single-writer
bench, the per-op cost dominates any tail benefit.
Before revisiting
Prerequisites for a re-attempt:
- A proper concurrent-reader + writer + checkpointer workload. The
profile harness is effectively single-isolate-at-a-time today, so
the scenario the flag targets is never exercised.
- A runtime VFS probe (
sqlite3_vfs_find+ capability check) so we
can gate the call on platforms where blocking locks are native.
- A much shorter timeout (e.g. 50–100 ms) so that if anything does
block, the worst case is bounded below busy_timeout (5 s) rather
than well above it.
Source edits are kept in the branch but not merged. The +1 μs p50
cost would be tolerable on its own, but the tail regressions — plus
the mechanism above that reproduces on non-contended workloads —
make this a clear reject at the multi-run median.
Protocol lessons
Two items worth carrying forward to future experiments:
- A compile flag that appears to "enable an optional runtime call"
may restructure internal data structures. Always read the
implementation to see what the flag actually changes inside the
library, not just what the advertised feature does. One
#ifdef block can swap a global mutex for an N-mutex array.
- Noop regressions are the cleanest tell. When a candidate
regresses on a workload that has no mechanism to benefit from it,
the per-op cost is real and any apparent "win" elsewhere is
likely either noise or a coincident second-order effect. If noop
regresses, reject or investigate before trusting any gains.
Reproducing
Single-run comparison (fast, but do not trust the p99/max numbers — see
the confirmation section above for why):
git checkout exp-088-setlk-timeout dart run -DRESQLITE_PROFILE=true benchmark/run_profile.dart \ --out=benchmark/profile/results/exp-088.json dart run benchmark/profile/diff.dart \ benchmark/profile/results/baseline-exp088.json \ benchmark/profile/results/exp-088.json Multi-run confirmation (what the "Rejected" decision is based on) — run
the candidate branch for the 5 candidate JSONs, then check out the
parent and re-run for the 5 baseline JSONs (or revert
hook/build.dart + native/resqlite.c for the baseline runs without
changing branches):
# Candidate (branch HEAD) for i in 1 2 3 4 5; do dart run -DRESQLITE_PROFILE=true benchmark/run_profile.dart \ --out=benchmark/profile/results/exp-088-run${i}.json done # Baseline (checkout or revert source files) git checkout HEAD~1 -- hook/build.dart native/resqlite.c for i in 1 2 3 4 5; do dart run -DRESQLITE_PROFILE=true benchmark/run_profile.dart \ --out=benchmark/profile/results/baseline-exp088-run${i}.json done git checkout HEAD -- hook/build.dart native/resqlite.c # Aggregate medians across runs dart run benchmark/profile/diff_multirun.dart \ --baseline='benchmark/profile/results/baseline-exp088-run*.json' \ --candidate='benchmark/profile/results/exp-088-run*.json'