Experiment 159: Writer request pipelining + persistent reply port

Date: 2026-06-09

Status: In Review

Direction:stream-rerun-dispatch

Problem

Exp 147 split writer-side burst wall and found SQLite-facing calls are a

minority of write cost (9.4% on A11c overlap, 18.1% keyed-PK); after

subtracting stream invalidation, residual writer/request wall is the

largest bucket (71.8% overlap, 63.3% keyed-PK, 73.4% even with zero

streams). Exp 148 showed that batching reader replies reduces callback

counters without moving measured elapsed. A closed residual-split branch

(PR #143, not merged) subdivided the residual further and concluded that

no single writer-side sub-bucket clears the decision threshold — "future

stream-dispatch work needs a workload-shape or scheduling-model change

rather than another sub-bucket optimization."

This experiment is that scheduling-model change. Three structural costs

sat on every write round-trip:

  1. A fresh RawReceivePort per request (Writer._request): one port

allocate/register/teardown cycle per write. The reader pool removed its

equivalent in exp 040; the writer never did.

  1. A guaranteed microtask hop before every send: _request awaited

_workerPort.future — a future that resolved once at spawn and never

again — so every send deferred to the microtask queue before reaching

SendPort.send.

  1. Async completers on the reply path: the reader pool switched its

per-worker completers to Completer.sync (exp 032); the writer's reply

path still paid the extra scheduling hops.

Separately, the write lock was held across the entire round-trip for

standalone writes, so concurrent db.execute() callers fully serialized:

send → worker exec → reply → main-isolate resume → unlock → next send.

The worker idled between requests even when more writes were queued on

the main isolate.

Hypothesis

Caching the worker SendPort, replacing per-request reply ports with one

persistent port + FIFO completer queue, and completing synchronously

removes fixed scheduling overhead from every write. Releasing the write

lock after the send (instead of after the reply) lets concurrent

standalone writes pipeline through the worker's port FIFO, overlapping

worker-side execution with main-isolate reply processing.

Safety argument: the writer isolate processes its port in FIFO order and

sends exactly one reply per request (handlers reply or throw; the

entrypoint converts throws into replies), so a FIFO queue of completers

matches replies to callers. A standalone write sent before a BEGIN is

fully processed at txDepth == 0 — including its dirty-set harvest —

before the transaction opens, so holding the lock across the write's

reply adds no exclusion a later BEGIN actually needs. Transactions still

hold the lock from BEGIN through COMMIT/ROLLBACK, so no standalone write

can be sent into an open transaction.

Approach

(no awaits before SendPort.send).

pending queue; completers are Completer.sync (reader-pool pattern).

Database.execute / Database.executeBatch call) acquire the mutex,

send, release in finally, and await the reply outside the lock. The

raw lock-held sends are named executeInTransaction /

executeBatchInTransaction / selectInTransaction — used by

Transaction.*, which runs under the enclosing transaction's lock —

and carry assert(_mutex.isLocked) so calling one without the lock

fails loudly in debug builds. Database.transaction keeps locked()

across the full transaction.

isolated; an error reply consumes exactly its own FIFO slot.

with sequential-awaited, concurrent-burst, and transaction-guardrail

shapes — the burst shape is the first benchmark in the repo where a

caller has more than one standalone write outstanding.

Results

Focused benchmark (writer_pipelining.dart, 7 rounds, medians)

shapebaselinecandidatedelta
concurrent-burst (10×200 writes), pass 1113.2 ms72.6 ms−36%
concurrent-burst (10×200 writes), pass 2100.5 ms55.3 ms−45%
sequential-awaited (2000 writes), pass 272.7 ms74.5 msneutral
transaction-guardrail (50×10), pass 213.0 ms12.2 msneutral

Writer wall split audit (exp 147 harness, single pass, back-to-back)

workloadbaseline wallcandidate wallresidual_us baseline → candidate
A11c baseline (0 streams)96.4 ms78.2 ms70,729 → 55,385
A11c disjoint101.0 ms83.8 ms56,185 → 45,729
A11c overlap202.5 ms192.1 ms143,527 → 139,549
keyed PK subscriptions45.8 ms39.1 ms28,296 → 22,583

Residual writer/request wall drops on every workload; the pure

round-trip shape (0 streams) improves most, consistent with the change

targeting fixed per-round-trip overhead. Single-pass: direction signal

only.

Tracelite A/B (stream-rerun-dispatch direction, two passes)

The canonical stream scenarios issue writes sequentially (await per

write plus an event-loop yield), so the pipelining component never

engages there — these scenarios act as guardrails for the

persistent-port / sync-completion components.

Pass 1 (baseline collected first, candidate second) flagged

high-cardinality-fanout +19.4% (CI 21.0..122ms) and

many-streams-writer-throughput +11.9% (CI 22.1..120ms), keyed-PK

too-noisy (max CV 49.9%). The candidate phase showed within-run CVs of

0.20–0.46 against the baseline phase's 0.01–0.06 — a contamination

signature, with the two sides collected in disjoint time blocks.

Pass 2 (order flipped: exp-159 code collected first, main second)

showed all three scenarios neutral with tight CVs (0.01–0.03 for the

exp-159 phase):

scenariodelta (main vs exp-150)95% CIverdict
high-cardinality-fanout+1.02%−1.73..9.11 msneutral
many-streams-writer-throughput−0.26%−12.9..9.89 msneutral
keyed-pk-subscriptions+4.00%−9.88..34.5 msneutral

exp-150's absolute medians in the clean pass (hcf 361.9–362.8 ms,

many-streams 576.5–582.3 ms) are equal to or better than main's own

medians from pass 1 (364.8–371.5 / 579.8–596.0 ms), confirming the pass-1

flags were environment drift, not the change. This follows exp 144's

two-independent-passes rule for regression flags; the order flip is what

makes the second pass discriminating.

Decision

In Review (accept-shaped). The change is a pure overhead removal on

the per-write round-trip plus a scheduling-model change for concurrent

standalone writes:

at — improve 36–45% on the focused benchmark.

keyed-PK (the exp 148 killer).

every audit workload.

two new tests (rollback isolation, FIFO error slot consumption).

Future Notes

sequentially, so the pipelining win is only visible in the focused

concurrent-burst benchmark. Promoting a concurrent-write scenario into

release or tracelite coverage (exp 116 pattern) would make regressions

in this path publicly visible.

itself (port wake + event-loop scheduling in both isolates). Further

reduction needs request batching across calls (group commit) or a

different transport (shared memory, when the Dart SDK allows it).

then all candidate runs) is vulnerable to time-correlated machine

drift. An order-flipped second pass discriminates drift from real

regressions cheaply.