Experiment 080: Dispatch budget research pass (Phase 1)

Date: 2026-04-18

Status: Phase 1 complete — measurements taken, candidates identified, no implementation yet

Goal

Before committing to any new optimization experiments (Phase 2+), figure

out where each millisecond actually goes on the workloads where resqlite

currently trails sqlite3: single inserts, point queries, merge rounds.

Methodology

Two layers of instrumentation, both designed to leave production code

near-untouched:

Layer 1 — ProfiledDatabase composition wrapper

benchmark/profile/profiled_database.dart — wraps Database (which is

final class, so composition not subclass), records per-call wall time

via Stopwatch. Zero production code changed. Records go into

ProfileSample objects serialized to JSON for reproducibility.

Layer 2 — Timeline markers at isolate boundaries

Two production code edits, both one-liners:

/ finishSync() around the per-message dispatch

Near-zero cost when no tracer is attached (a single load + branch in

Timeline.startSync). Enables cross-isolate breakdown in DevTools when

run with dart --observe — no custom protocol needed.

Baseline: noop dispatch floor

Added two "empty SQL" workloads to isolate pure round-trip cost:

(acquires writer mutex, runs authorizer + preupdate, commits empty frame)

Results

100 iterations per workload. All numbers in microseconds. Median + p90

+ p99 from ~10K–50K samples per op.

Dispatch floors (what every call costs before doing any real work)

PathFloor (p50)Floor (p90)Floor (p99)
Reader (SELECT 1)7μs15μs45μs
Writer (UPDATE WHERE 1=0)10μs19μs58μs

The 3μs gap between reader and writer is attributable to the writer's

extra per-call machinery: mutex acquisition, preupdate hook enable/

disable, dirtyTables assembly, transaction state tracking.

Target workloads

WorkloadMedian totalDispatchActual workDispatch %
Single insert16μs10μs~6μs63%
Point query (PK lookup)7μs7μs~0μs~100%
Merge rounds (100-row batch)107μs10μs~97μs9%

Immediately apparent:

  1. Point queries are at the dispatch floor. The 7μs per lookup is

essentially 100% isolate round-trip — the actual `SELECT * WHERE id

= ?` does almost no measurable work on our end (single-row result,

stmt cached, reader's work is indistinguishable from noop). To make

point queries faster we must make dispatch faster — no algorithm

change or query-plan change will help.

  1. Single inserts are ~63% dispatch-bound. The 6μs of "real work"

(prepare reuse via cache, bind, step, commit) is nearly

irreducible. Halving dispatch would halve total per-insert time.

  1. Batched writes amortize dispatch cleanly. At 100 rows per batch,

dispatch is 9% of total. This is why executeBatch is already fast

— batching is the fix, and we already ship it. No work needed here.

Implied ceiling vs sqlite3

Using the public dashboard's normalized-PRAGMA numbers:

Workloadsqlite3resqliteGapAchievableMethod
Point query5.2μs7μs1.35×~5.5μs (ceiling)Cut dispatch 20–30%
Single insert8.9μs16μs1.8×~10–12μsCut dispatch 30–50%
Merge rounds95μs107μs1.13×~95–100μsMinor C-side tuning

We cannot cross sqlite3 on these workloads without removing the

writer isolate (which kills the Flutter UI-thread story). But the gaps

can narrow meaningfully: the 1.8× single-insert gap could become

~1.2× with a 30–50% dispatch reduction.

Tail latency observations

p99 is 3–10× median across every workload. One outlier on single

insert was 34 ms (vs 16μs median — 2000×!). Candidates:

029). If a passive checkpoint fires mid-insert, it blocks the writer.

lists, result lists) can trigger generational GC.

leave the writer isolate unscheduled for longer than usual.

A separate p99-reduction pass might be worth more than absolute

median-reduction work, because Flutter UI quality is dominated by

worst-case frame budget. A 2× p99 improvement ≫ a 10% p50 improvement

for perceived smoothness.

Benchmark-fidelity side finding

The Sync Burst merge-rounds benchmark reports 251μs per batch, but

my isolated profile shows 107μs per batch. The gap (~144μs) is the

cost of having an active SELECT COUNT(*) stream during the write:

each committed batch triggers the stream-engine path (writeGen bump,

_scheduleReQuery, reader hash-recompute, etc), and that work adds up.

That's not a bug — it's a legitimate real-world cost. But it means

benchmark numbers for "write cost" on scenarios with active streams

are actually measuring "write + stream invalidation". Useful to know

when attributing gains/losses across experiments.

Candidate optimizations (Phase 2 portfolio)

Ordered by (expected impact × confidence) / effort.

Tier 1 — highest expected impact

C1. Reduce writer dispatch floor (10μs → target 6–7μs)

skip the enable if no streams are active? (Partial exp 077 territory,

but the hook enable/disable itself isn't what 077 optimized.)

the zero-write path still paying for a traversal?

C2. Reduce reader dispatch floor (7μs → target 5μs)

baseline, any reduction here is pure point-query QPS gain.

(main→reader send, reader receive, reader→main send, main receive).

Probably need more Timeline markers or manual instrumentation.

request+response allocation may be visible.

a single call) + 1 optimization experiment based on findings. 3–5

day effort together.

Tier 2 — worth measuring before committing

C3. Tail-latency / p99 investigation

relative to ongoing requests; correlate with p99 spikes. Either

(a) tune the checkpoint cadence, or (b) offload checkpoints to a

dedicated mini-isolate so they never block the writer.

days including measurement).

C4. Stream-invalidation cost on batched writes

triggers a full re-query even when hash suppression will drop the

emission. Could we defer the actual selectIfChanged dispatch if a

newer batch is already queued on the writer?

targets a different mechanism (defer the re-query dispatch itself,

not defer the emission).

Tier 3 — speculative

C5. Revisit exp 055 (columnar typed arrays) with memory harness

benchmarks show resqlite p90 RSS deltas of 7–33 MB on 10K-row ops.

Columnar arrays would target that directly.

be a second-axis win.

C6. FFI call consolidation

reset, finalize). exp 009 batched some of these. Are there still

multi-FFI paths that could be collapsed into a single C entry

point?

inside C1/C2.

Tier 4 — explicitly deferred

Fidelity wins this phase delivered

Beyond "we know where time goes," this instrumentation pays ongoing

dividends:

  1. The 2 Timeline markers let any future profiling session (DevTools,

dart observe) see cross-isolate breakdowns without re-adding

instrumentation. Standard Dart idiom, essentially free.

  1. ProfiledDatabase wrapper is reusable — any future experiment

can instantiate it to capture per-call timing for a specific

workload. No permanent production code needed.

  1. The dispatch_budget.dart harness produces a reproducible JSON

that can be diff'd across experiments. Run before/after an

optimization to measure its exact impact on dispatch vs work split,

not just aggregate wall time.

  1. Baseline noop floors (7μs reader / 10μs writer) — gives every

future experiment a "this is the floor, don't expect to go below"

anchor. Exp 063 (SelectOne fast path) was rejected for "below

benchmark floor" — if we'd had these numbers, we'd have known it

was because the benchmark was noise-limited AT the dispatch floor,

not because the optimization was useless.

  1. The noop-subtract technique: total_time - noop_baseline =

actual_work_time. Lets us split dispatch from work cleanly without

needing in-isolate instrumentation beyond what's there. This is

the biggest methodological unlock — future experiments can report

"this saves X μs of work on top of Y μs of unavoidable dispatch"

instead of just raw total-time deltas.

Recommendation

Proceed to Phase 2 with C1 as the first experiment. Concrete

hypothesis ("writer's 3μs overhead over reader is from zero-write-path

authorizer/preupdate/dirtyTables work"), measurable acceptance

criterion (writer floor ≤ 8μs), well-scoped effort (2–3 days).

If C1 is accepted, C2 (reader floor reduction) is the next logical

target. C3 (tail latency) can run in parallel because it's orthogonal

(p99 vs p50).

Explicitly not recommending:

remaining gap vs sqlite3 there is irreducible given architecture.

Reproducing this analysis

 # Check out this branch git checkout exp-080-dispatch-profile # Run the harness dart run benchmark/profile/dispatch_budget.dart # For cross-isolate timeline (optional — requires DevTools): dart --observe --profile-period=100 benchmark/profile/dispatch_budget.dart # Open the service URL in DevTools → Performance tab → record. # Raw JSON output goes to: ls -t benchmark/profile/results/dispatch_budget_*.json 

Production code footprint: 2 added Timeline markers (one per worker

isolate), 2 new imports. All other code lives in benchmark/profile/

and can be deleted without consequence.